Code
library(ggplot2)
library(ggrepel)
library(ggforce)
library(pheatmap)
library(UpSetR)
library(viridis)
library(rstatix)
library(Cairo)Alternative splicing (AS) events included in this analysis:
alternatively spliced exons and microexons
intron retention
alternative splice sites (acceptor at 3’ and donor at 5’)
For each sample AS events were quantified using the percentage of sequence inclusion (PSI) metric, which is a number ranging from zero to 100, corresponding to full sequence skipping or full sequence inclusion in all transcripts of a gene, respectively. For example, constitutively spliced exons have a PSI of 100. The PSI cannot be calculated for the very first and last exons of a gene.
These documents reports the RNA-seq downstream AS analysis after processing the samples with VAST-TOOLS(Tapial et al. 2017) for SRRM2 Knockdown (KD) in HeLa(Zhang et al. 2024) (Bo Wang lab) and HepG2(Cui et al. 2023) (Ben Blewncowe, Alan Lambowitz & Paul Schimmel) cells.
Last code execution: 2025 March 03, Monday @ 17:03:52.
vast-tools(Tapial et al. 2017) was used to quantify both AS events PSI and gene expression as normalised TPMs using limma::normalizeBetweenArrays and cRPKMs per gene. The analysis was performed using vast-tool v2.5.1.
The raw FASTQ files were processed as following:
vast-tools maps the RNA-seq reads to a predefined set of exon-exon junctions (EEJ). This set of EEJ is an “enriched” version of all ENSEMBL transcripts. This dataset was analysed using Homo sapiens assembly hg38, based on ENSEMBL v88. To do so vast-tools align first trims the reads to 50 nucleotides fragments with a sliding window define by --stepSize (usually 25 or 20 nt). The trimmed reads are aligned to the EEJ libraries using bowtie v1.2. The trimmed reads must have at least 8 nucleotides overlapping the splice site and be uniquely mapping to be considered for quantification.
For this analysis fastq files were processed with vast-tools align with options --sp hg38, ---IR_version 2, --stepSize 20, and --expr. Reads strand detection is done automatically.
To increase sequencing read depth the vast-tools has a module called merge that pools the individual replicates of each condition into a super sample. This is basically equivalent to concatenating FASTQ files with cat before running align. This helps prevent biases in AS events identification towards highly expressed genes. For this analysis the 3 individual replicates per condition were merged.
To build the main output containing all detected AS events the vast-tools combine module takes intermediate output files and generate a table named as INCLUSION_LEVELS_FULL-<SPECIES_ASSEMBLY>-<SAMPLES_NUM>-v251.tab. This file contains an AS event on each row, the first 6 columns contain basic information about the event (e.g. gene name, length, and coordinates) followed by a column with the PSI and a quality score column for each sample. Further details about the comma separated quality scores can be found here. This step also generates a similar table gene expression counts expressed as cRPKM or TPM.
All individual replicates and the merged super samples were combined with vast-tools combine with options --add_version, --TPM and --norm for gene expression. For this analysis the output file has 721551 AS events.
To calculate the differential inclusion level (∆PSI) between samples the combined table was processed with vast-tools compare using the knock down control samples (-a) and the following filtering parameters: --min_dPSI 15, --min_range 0, --max_dPSI 5, --print_sets, --noB3, --p_IR, --print_dPSI, and --GO. The HeLa and HepG2 knock downs were each compared to the respective controls separately. The comparison was done using the 3 individual replicates per condition or by using only the merged super sample. This is denoted in the file names as _uniq_ or _mrgd_.
To simplify and filter the full inclusion table the module vast-tools tidy produces a table with only events that have a minimum coverage that are more likely to relevant for downstream analysis.
Full inclusion table was filtered using vast-tools tidy with: --noB3, --p_IR, --add_names, and --log. The tidy module has an option to discard the VLOW events (see below). Here since there are replicates the VLOW events are include in the analysis.
Lastly, a statistical differential splicing analysis was performed with vast-tools diff using Bayesian inference. Briefly , it calculates PSI posterior distributions using Bayes’ theorem and employs replicates to estimate joint posterior distributions. The statistical significance between two posterior distributions is determined by comparing their differences using empirical distributions (Han et al. 2017) (Weatheritt, Sterne-Weiler, and Blencowe 2016).
The vast-tools diff module was run with options: -n 10000 (lines to process in parallel), -m 0.15 (min ∆PSI/100) -e 10 (minimum reads), -z 16 (random seed), and --noPDF (do not plot the estimates).
I use 3 different methods to select differentially spliced events (DSEs).
The compare module pre-filters the events based on read coverage imbalance and other features. The events are then simply filtered by calculating average and individual ∆PSIs. Between control and test samples the PSI distribution must be non overlapping, i.e. at least 10 PSI difference between the maximum and minimum PSI of the 2 groups (--min_range 10).
compare is a non-statistical approach
Differentially spliced events have a minimum |∆PSI| >= 15. This might have the caveat of missing AS events in lowly expressed genes.
Using the inclusion table generated by the tidy module, the events were further filtered to have coverage in at least 2 samples per condition and statistical difference between the average PSI in each condition was calculate using the Student’s t-test followed by multiple hypothesis testing (BH) correction using rstatix::t_test in R.
Events with a P-value <= 0.01 and a |∆PSI| >= 10 were considered significant. The non parametric Mann-Whitney (rstatix::wilcox_test) was also used to check if not assuming normality impacts the results drastically in a separate appendix.
The module diff uses:
A uniform prior distribution Beta(α = 1, β = 1).
A binomial likelihood function for the number of inclusion reads K ~ Binomial(Ψ, N).
where Ψ represents PSI for any AS event and N is the total number of junction reads per event. The formula K ~ Binomial(Ψ, N) means that the number of inclusion reads (K) is assumed to follow a binomial distribution with Ψ as the probability of inclusion (i.e., success) and N as the total number of junction reads per splicing event.
This prior distribution represents our initial beliefs about the PSI before observing any data, while the likelihood function represents the probability of observing the data given a specific PSI value. The likelihood function follows a binomial distribution, which is appropriate for count data like the number of inclusion reads.
According to Bayes’ theorem, our prior beliefs about PSI are updated based on observed data to obtain the posterior distribution. In this case, the prior distribution (uniform Beta) and the likelihood function (binomial) are “conjugate” to each other, which means that the posterior distribution can be expressed in a closed form using the same type of distribution as the prior. Therefore the posterior distribution over Ψ, is represented as a Beta distribution with parameters Ψ ≈ Beta(K + α, (N – K) + β).
If replicates are available, joint posterior distributions for a sample are estimated from sampling empirical posterior distributions of the replicates and fitting a new posterior Beta with maximum likelihood estimation (MLE) using the MASS::fitdist function in R. This basically means combining the information from multiple replicates to get a more robust estimate of the PSI. The statistical significance between two biological conditions, defined as the level of confidence in determining whether two posterior PSI distributions are significantly different from each other, modelled as two posterior distributions X~ Beta, and Y ~ Beta, is calculated as P(X – Y > 0), i.e., X is higher than Y. This probability can be estimated from the difference of empirical distributions sampled between X and Y such that P(X – Y > 0) =\(\sum_{i=1}^{n} (Xi –Yi > 0) / N\). Lastly, significantly differential events were additionally required to have a PSI difference > 10.
Using the diff module of vast-tools filter for events with a Minimum Value (MV) of |∆PSI| at 95% confidence >= 15%. This means that each event as a 95% probability to have a |∆PSI| between the 2 conditions higher or equal to the minimum value, which is 15% ∆PSI.
Load common packages that have to be pre-installed. Check Section 10 for package versions.
library(ggplot2)
library(ggrepel)
library(ggforce)
library(pheatmap)
library(UpSetR)
library(viridis)
library(rstatix)
library(Cairo)In addition, I developed an R package called niar to stream line some of the common analysis steps. It can be installed from my GitHub repository using:
devtools::install_github("Ni-Ar/niar")
library(niar)Here I load the package from my local repository with:
local <- FALSE
if(local) {
devtools::load_all(path = '~/mnt/narecco/software/R/niar')
} else if (local == FALSE) {
devtools::load_all(path = '~/software/R/niar')
} else {
stop('local is not a Logical')
}Colour palettes:
Events coverage quality score colour palette.
quality_score_colors <- c('#ffffcc', '#c2e699', '#78c679', '#31a354', '#006837')
quality_score_values <- c("N", "VLOW", "LOW", "OK", "SOK")
names(quality_score_colors) <- quality_score_valuesSRRM2-KD palette
sample_palette <- c('#ffb703', '#219ebc')
names(sample_palette) <- c('HeLa', 'HepG2')Some generic functions I wrote for this analysis:
A plotting function to explore sample’s ∆PSI of exons and introns
#' Group differentially spliced exons and introns
#'
#' @param data A data frame with the columns `dPSI`, `COMPLEX`, `comparison`
#'
#' @return A ggplot
plot_dPSI_hist_grpd <- function(data, xlim = c(-60, 60) ) {
exon_type <- c("S", "C1", "C2", "C3", "ANN", "MIC")
intron_type <- c("IR")
alt_ss <- c("Alt3", "Alt5")
data |>
select(dPSI, COMPLEX, comparison) |>
mutate(TYPE = case_when(COMPLEX %in% exon_type ~ 'Exon',
COMPLEX %in% intron_type ~ 'Intron',
COMPLEX %in% alt_ss ~ 'Alt. s.s.',
) ) |>
# subset( !COMPLEX %in% c('Alt3', 'Alt5') ) |>
mutate(TYPE = factor(TYPE, levels = c('Exon', 'Intron', 'Alt. s.s.'))) |>
unique() |>
ggplot() +
aes(x = dPSI, fill = dPSI > 0 ) +
facet_grid(TYPE ~ comparison, scales = 'free_x',
labeller = labeller(comparison = c(HeLa_SRRM2KD_mrgd = 'HeLa SRRM2 KD',
HepG2_SRRM2KD_mrgd = 'HepG2 SRRM2 KD') ) ) +
geom_histogram(binwidth = 1, show.legend = F) +
scale_x_continuous(n.breaks = 10) +
scale_y_continuous(n.breaks = 4, expand = expansion(mult = c(0, 0.01))) +
scale_fill_manual(values = c("TRUE" = "firebrick3", "FALSE" = "dodgerblue3")) +
labs(x = "\u0394PSI (KD - Cntrl)", y = "Num. of events") +
coord_cartesian(xlim = xlim) +
theme_classic(base_family = 'Arial', base_size = 6) +
theme(legend.position = 'none',
plot.background = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line(colour = 'gray84', linewidth = 0.1),
panel.grid.minor.y = element_blank(),
panel.border = element_blank(),
axis.text = element_text(colour = 'black'),
strip.background = element_blank() ) -> cmpr_dPSI_Hist
return(cmpr_dPSI_Hist)
}QC plot with the number of events within each quality score
#' Show the number of events by quality score in a stacked bar plot
#'
#' @param data A dataframe with the complex `COMPLE` and `Quality_Score_Value`
#'
#' @return A ggplot
plot_quality_score_stacked <- function(data) {
ggplot(data) +
aes(x = COMPLEX, fill = Quality_Score_Value) +
geom_bar(colour = 'black', linewidth = 0.2) +
scale_fill_manual(values = quality_score_colors, name = "Score") +
scale_y_continuous(expand = expansion(mult = 0, add = 1), n.breaks = 10) +
labs(x = 'Type of AS event', y = 'Number of AS event') +
theme_classic(base_family = 'Arial', base_size = 6) +
theme(legend.position = 'inside',
legend.position.inside = c(0.45, 0.95),
legend.key.size = unit(2, units = 'mm'),
legend.direction = 'horizontal',
plot.background = element_blank(),
panel.background = element_blank(),
panel.grid.major.y = element_line(linewidth = 0.2, colour = 'grey84'),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.border = element_blank(),
axis.text = element_text(colour = 'black'),
axis.ticks = element_line(linewidth = 0.2),
axis.ticks.x = element_blank(),
axis.line = element_line(linewidth = 0.2)) -> cmpr_Scores_Bars
return(cmpr_Scores_Bars)
}Wrapper to explore the PSI and quality scores of specific events.
plot_PSI_boxplot <- function(df, num_col = 5, order_by_dPSI = TRUE, legend_xy = c(0.93, 0.1) ) {
if (order_by_dPSI) {
df <- df |>
mutate(EVENT = fct_reorder(EVENT, dPSI))
gene_order <- unique( df[order(df$EVENT, decreasing = T), ]$GENE)
df <- df |>
mutate(GENE = factor(GENE, levels = gene_order))
}
ggplot(df) +
aes(x = PrettyGroup, y = PSI) +
facet_wrap( ~ GENE + EVENT, scales = "free_x", ncol = num_col) +
geom_boxplot(fill = NA, outlier.shape = NA, lwd = 0.2) +
geom_point( aes(fill = Quality_Score_Value, shape = Replicate),
size = 1.5, stroke = 0.2, position = position_dodge(width = 0.5)) +
scale_shape_manual(values = c('Merge' = 22, 'A' = 21, 'Aa' = 21, 'Ab' = 21,
'B' = 21, 'Ba' = 21, 'Bb' = 21,
'C' = 21, 'Ca' = 21, 'Cb' = 21 ) ) +
scale_fill_manual(values = quality_score_colors, name = "Quality\nScore") +
scale_x_discrete(labels = \(x) gsub(pattern = '_', replacement = '\n', x) ) +
scale_y_continuous(breaks = seq(0, 100, 10),
expand = expansion(mult = 0, add = 2)) +
guides(fill = guide_legend(override.aes = list(shape = 21)),
shape = 'none') +
coord_cartesian(ylim = c(0, 100), clip = 'off') +
labs(y = 'PSI') +
theme_classic() +
theme(axis.text = element_text(colour = "black"),
axis.line = element_line(linewidth = 0.2),
axis.title.x = element_blank(),
strip.background = element_blank(),
strip.text = element_text(vjust = 1, lineheight = 0,
margin = margin(t = 0, unit = 'mm')),
panel.grid.major = element_line(colour = 'gray84', linewidth = 0.15),
legend.position = 'inside',
legend.position.inside = legend_xy )
}Calculate z-score on a numeric matrix
zScore <- function(mat) {
row_means <- apply(mat, 1, mean)
row_sd <- apply(mat, 1, sd)
# handle degenerate distribution in which the denominator (row sd) of the z-score formula is zero.
# the z-score becomes undefined as the denominator of the z-score formula is zero.
# this is a rare situation in general. Here I fix it by diving by 1 instead.
if( any(row_sd == 0) ) { row_sd[which(row_sd == 0)] <- 1 }
zScore_mat <- (mat - row_means) / row_sd
return(zScore_mat)
}Alternative formula to z-score that used median
zMad <- function(mat) {
row_medians <- apply(mat, 1, median)
row_median_abs_dev <- abs(mat - row_medians)
median_row_median_abs_dev <- apply(row_median_abs_dev, 1, median)
# conceptually row_mad is the sd. 1.482 is a constant linked to the assumption of normality of the data
row_mad <- median_row_median_abs_dev * 1.4826
# If row max is zero fix it by diving by 1 instead.
if( any(row_mad == 0) ) { row_mad[which(row_mad == 0)] <- 1 }
zMad_dev <- (mat - row_medians) / row_mad
return(zMad_dev)
}
# mat <- head(tidy_full_mat)
# zMad(mat)
# zScore(mat)Function I wrote to make sure alternative donors and acceptors are not repeated between up and down regulated sets (used in tidy analysis).
#' Match Alternative 3' or 5' Splice Site IDs
#'
#' This function takes two sets of upregulated and downregulated alternative 3' or 5' splice site IDs
#' (VAST-TOOLS format) and resolves them by ensuring the IDs are correctly matched, taking into account
#' variations in the alternative splicing.
#'
#' @param up_ids A character vector of upregulated splice site IDs (VAST-TOOLS format).
#' @param do_ids A character vector of downregulated splice site IDs (VAST-TOOLS format).
#' @return A character vector of resolved and unique splice site IDs.
#' @examples
#' up_ids <- c("HsaALTA0001463-1/2", "HsaALTA0006052-1/2", "HsaALTA0006256-2/2",
#' "HsaALTA1001090-3/3", "HsaALTA1015934-2/4")
#' do_ids <- c("HsaALTA0001463-2/2", "HsaALTA0001572-1/5", "HsaALTA0006052-2/2",
#' "HsaALTA0006256-1/2", "HsaALTA0007106-4/5", "HsaALTA0008669-2/4", "HsaALTA1001090-2/3")
#' match_alt35_id(up_ids, do_ids)
#' # Returns: "HsaALTA0001463-1/2" "HsaALTA0006052-1/2" "HsaALTA0006256-2/2"
#' # "HsaALTA1001090-3/3" "HsaALTA1015934-2/4" "HsaALTA0001572-1/5"
#' # "HsaALTA0007106-4/5" "HsaALTA0008669-2/4"
match_alt35_id <- function(up_ids, do_ids) {
# Calculate the number of unique IDs combining both upregulated and downregulated sets
all_up_do_ids_len <- length(unique(c(up_ids, do_ids)))
# Remove specific splice variation suffix (-x/y) and then calculate the number of unique IDs
sam_up_do_ids_len <- length(unique(c(str_remove(string = up_ids, pattern = '\\-[1-9]/[1-9]'),
str_remove(string = do_ids, pattern = '\\-[1-9]/[1-9]'))))
# Case 1: If the number of unique IDs remains the same after removing variations, return the unique set
if (all_up_do_ids_len == sam_up_do_ids_len) {
return(unique(c(up_ids, do_ids)))
}
# Case 2: If the number of unique IDs differs, identify and filter specific splice IDs
else if (all_up_do_ids_len > sam_up_do_ids_len) {
# Check if the downregulated set has more or equal IDs compared to the upregulated set
if (length(do_ids) >= length(up_ids)) {
# Identify and retain downregulated splice IDs that do not match upregulated IDs without the suffix
indx_do_ids <- which(!str_remove(string = do_ids, pattern = '\\-[1-9]/[1-9]') %in%
str_remove(string = up_ids, pattern = '\\-[1-9]/[1-9]'))
do_ids <- do_ids[indx_do_ids]
return(unique(c(up_ids, do_ids)))
}
# Check if the upregulated set has more or equal IDs compared to the downregulated set
if (length(up_ids) >= length(do_ids)) {
# Identify and retain upregulated splice IDs that do not match downregulated IDs without the suffix
indx_up_ids <- which(!str_remove(string = up_ids, pattern = '\\-[1-9]/[1-9]') %in%
str_remove(string = do_ids, pattern = '\\-[1-9]/[1-9]'))
up_ids <- up_ids[indx_up_ids]
return(unique(c(up_ids, do_ids)))
}
}
}Files and directories paths. All files are backed-up on the CRG cluster.
if(local) {
proj_dir <- file.path('~/mnt/narecco/projects/01_ALTdemix')
} else if (local == FALSE) {
proj_dir <- file.path('~/projects/01_ALTdemix')
} else {
stop('local is not a Logical')
}
proj_dir <- normalizePath(proj_dir)
expr_dir <- file.path(proj_dir, 'data/INCLUSION_tbl/Tanja')
vast_tools_dir <- file.path(expr_dir, 'vast_tools')
vast_out <- file.path(expr_dir, 'vast_tools/vast_out')
psi_path <- file.path(vast_out, 'INCLUSION_LEVELS_FULL-hg38-34-v251.tab')
expr_path <- file.path(vast_out, 'TPM-hg38-34-NORM.tab')
cmpr_dir <- file.path(vast_out, 'compare_2024_03_26/min_dPSI15_min_range0_max_dPSI5')
cmpr_HeLa_SRRM2_uniq_path <- file.path(cmpr_dir, 'HeLa_SRRM2KD_vs_CNTRL_uniq_noB3_pIR.tab')
cmpr_HepG2_SRRM2_uniq_path <- file.path(cmpr_dir, 'HepG2_SRRM2KD_vs_CNTRL_uniq_noB3_pIR.tab')
cmpr_HeLa_SRRM2_mrgd_path <- file.path(cmpr_dir, 'HeLa_SRRM2KD_vs_CNTRL_mrgd_noB3_pIR.tab')
cmpr_HepG2_SRRM2_mrgd_path <- file.path(cmpr_dir, 'HepG2_SRRM2_vs_CNTRL_mrgd_noB3_pIR.tab')
tidy_dir <- file.path(vast_out, 'tidy_2024_03_25')
tidy_psi_path <- file.path(tidy_dir, 'INCLUSION_LEVELS_TIDY_HeLa_HepG2_noB3_pIR.tab')
diff_dir <- file.path(vast_out, 'diff_2024_03_25/min_dPSI_min_reads10')
diff_HeLa_SRRM2_path <- file.path(diff_dir, 'HeLa_SRRM2KD_minRead10_noB3_pIR_fltrd_MVdPSI15.tab')
diff_HepG2_SRRM2_path <- file.path(diff_dir, 'HepG2_SRRM2KD_minRead10_noB3_pIR_fltrd_MVdPSI15.tab')
plot_dir <- file.path(expr_dir, 'plots/SRRM2')
if ( !dir.exists(plot_dir) ) { dir.create(path = plot_dir, recursive = T) }
dse_IDs_dir <- file.path(expr_dir, 'diff_spliced_IDs/SRRM2-KD')
if ( !dir.exists(dse_IDs_dir) ) { dir.create(path = dse_IDs_dir, recursive = T) }Select the differentially spliced events based on the 3 different methods described above. This is the core part of this report.
Import all the tables from the vast-tools compare analysis in a list.
list.files(cmpr_dir, full.names = T, pattern = '^He.*SRRM2.*_noB3_pIR.tab$') |>
lapply(function(x) {
read_vst_tbl(path = x, show_col_types = FALSE) |>
tidy_vst_psi() |>
mutate(comparison = gsub(pattern = '_noB3_pIR.tab', replacement = '', x) ) |>
mutate(comparison = gsub(pattern = paste0(cmpr_dir, '/'), replacement = '', comparison) ) |>
mutate(comparison = gsub(pattern = "vs_CNTRL_", replacement = '', comparison) ) |>
mutate(CellType = str_extract(pattern = "^(HepG2|HeLa)", string = Sample) )
# mutate(comparison = gsub(pattern = "(HepG2|HeLa)_", replacement = '', comparison) ) |>
}) -> list_comparesBind the list of data frames into one single data frame.
cmpr_events <- do.call('rbind', list_compares) Create a metadata data frame from the samples’ names.
cmpr_events |>
select(Sample, CellType, comparison) |>
unique() |>
mutate(KD = gsub(pattern = 'KD_.*$', replacement = '', comparison ) ) |>
mutate(KD = str_remove(string = KD, pattern = '(HepG2|HeLa)_') ) |>
mutate(Replicate = str_split_fixed(string = Sample, pattern = "_", n = 3)[,3]) |>
mutate(Replicate = ifelse(Replicate == '', yes = 'Merge', no = Replicate ) ) |>
mutate(PrettySample = str_remove(string = Sample, pattern = '(HepG2|HeLa)_') ) |>
mutate(PrettyGroup = str_remove(string = Sample, pattern = '_[aA-cC].$'),
PrettyGroup = str_remove(string = PrettyGroup, pattern = '_(A|B|C)$') ) |>
mutate(PrettyGroup = factor(PrettyGroup,
levels = c("HeLa_CNTRL", "HeLa_SON", "HeLa_SRRM2",
"HepG2_CNTRL", "HepG2_MARS",
"HepG2_RARS", "HepG2_SRRM2") ) ) |>
relocate(PrettySample, PrettyGroup, .after = Sample) |>
# drop comparison column
select( !c(comparison) ) |>
unique() -> mtdfAdd metadata info to the compared events table.
cmpr_events <- left_join(x = cmpr_events, y = mtdf,
by = join_by(Sample, CellType) ) Not all AS events are easily measurable and not all events have the same sequencing coverage.
The AS events quantified by vast-tools are divided in different subgroups based on their complexity, which is defined in the 6th column (COMPLEX) of the compare table. This complexity resemble the underlying exon-intron structures in the gene body. The complexity of an AS exon was established based on Score 5 from the quality column of the inclusion table. It summaries the number of reads that come from the “reference” C1-A, A-C2 and C1-C2 junctions for a wide panel of RNA-seq samples (see Irimia et al. 2014 for further information). The complexity score is the same for all the samples for any given event.
Namely:
S, C1, C2, C3 are types of exon skipping events quantified by the splice site-based or transcript-based modules, with increasing degrees of complexity (e.g. S = simple; C3 more complex than C2).ANN are exon skipping events quantified by the ANNOTATION module. Their IDs start with 6 or 7 (e.g. HsaEX6000001 or HsaEX7000001). These are exons from the reference GTF annotation used to build VAST-TOOLS. However, some annotated events are not present, because of low mappability or reads imbalance.MIC are exon skipping events quantified by the microexon pipeline.IR are intron retention event.Alt3 are alternative acceptor events at the 3’ splice site of an exon.Alt5 are alternative donor events at the 5’ splice site of an exon.In addition, an AS event can be measured at different sequencing depths. In vast-tool this is resembled by Score 1 of the quality column. Each event in each samples is given a score based on the actual read counts, before mappability correction. This coverage scores are: N, VLOW, LOW, OK, and SOK, (where N = Not meeting minimum threshold, VLOW = Very low and SOK = Super okay). In general anything above N can be considered reliable and worth considering, but VLOW events might have an actual different PSI value if confirmed experimentally with a PCR on RT-cDNA.
Here I check if the quality score of the events changes when including or excluding the super sample.
ggplot(cmpr_events) +
aes(x = Quality_Score_Value, fill = comparison) +
facet_grid(~COMPLEX) +
geom_bar(position = position_dodge(), width = 0.75, colour = 'black', linewidth = 0.1) +
scale_fill_manual(values = c('#FFCC33', 'darkgoldenrod', '#009990', 'green4'),
name = 'compared against control') +
scale_y_continuous(expand = expansion(add = c(0, 1)), n.breaks = 10) +
labs(x = 'AS event quantification quality score',
y = 'Number of AS events') +
theme_bw(base_family = 'Arial') +
theme(axis.text = element_text(colour = 'black'),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.ticks.x = element_blank(),
strip.background = element_blank(),
plot.background = element_blank(),
legend.position = 'bottom',
legend.key.size = unit(x = 3, units = 'mm'),
legend.background = element_blank(),
legend.margin = margin(t = 0, unit = 'mm'),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linewidth = 0.1, colour = 'gray84'),
panel.background = element_blank(),
panel.border = element_rect(linewidth = 0.2, colour = 'black')) -> p_QC_events_by_type
p_QC_events_by_typeAs one can see there’s a gain in lower quality events if including the merged replicates super samples. This behaviour is kind of expected.
ggsave(filename = '01_QC_AS_Events_by_Type_Coverage_Barplot.pdf', plot = p_QC_events_by_type,
device = cairo_pdf, path = plot_dir, units = 'cm', width = 30,
height = 10)When filtering out for N events…
fltrd_cmpr_events <- subset(cmpr_events, as.integer(Quality_Score_Value) >= 2)…and repeating the plot.
ggplot(fltrd_cmpr_events) +
aes(x = Quality_Score_Value, fill = comparison) +
facet_grid(~COMPLEX) +
geom_bar(position = position_dodge(), width = 0.75, colour = 'black', linewidth = 0.1) +
scale_fill_manual(values = c('#FFCC33', 'darkgoldenrod', '#009990', 'green4'),
name = 'compared against control') +
scale_y_continuous(expand = expansion(add = c(0, 1)), n.breaks = 10) +
labs(x = 'AS event quantification quality score',
y = 'Number of AS events') +
theme_bw(base_family = 'Arial') +
theme(axis.text = element_text(colour = 'black'),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.ticks.x = element_blank(),
strip.background = element_blank(),
plot.background = element_blank(),
legend.position = 'bottom',
legend.key.size = unit(x = 3, units = 'mm'),
legend.background = element_blank(),
legend.margin = margin(t = 0, unit = 'mm'),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(linewidth = 0.1, colour = 'gray84'),
panel.background = element_blank(),
panel.border = element_rect(linewidth = 0.2, colour = 'black')) -> p_fltrd_QC_events_by_type
p_fltrd_QC_events_by_typeThe number of DSEs is usally always higher for the mrgd super samples comparison. There seems to be a good accordance in the number of DSE in HeLa and HepG2 cells. Also there are basically no microexon events differentially spliced.
ggsave(filename = '02_QC_AS_Events_No_N_by_Type_Filtered_by_Coverage_Barplot.pdf',
plot = p_fltrd_QC_events_by_type, device = cairo_pdf,
path = plot_dir, units = 'cm', width = 30, height = 10)Start by importing the tidy data set. Here “tidy” means a subset of all identified AS events that have generally a good score (i.e. coverage) and balanced read counts on both splice sites.
tidy_events <- read_vst_tbl(path = tidy_psi_path, show_col_types = FALSE) |>
mutate(GENE = str_remove(pattern = '=Hsa.*$', string = EVENT) ) |>
mutate(EVENT = str_remove(pattern = '^*.*=', string = EVENT) ) |>
relocate(GENE, .before = EVENT) |>
pivot_longer(cols = starts_with('He'), names_to = 'Sample', values_to = 'PSI')
message('Tidy events: ', length(unique(tidy_events$EVENT)))Tidy events: 34490
tidy_events_backup <- tidy_eventsSelect only the samples from the SRRM2 KD experiment
tidy_events <- tidy_events[grepl(pattern = '(CNTRL|SRRM2)', x = tidy_events$Sample, perl = T), ]First let’s use these events to check the dispersion of the samples with a PCA. First turn the PSI dataframe into a matrix
tidy_events |>
select(GENE, EVENT, Sample, PSI) |>
pivot_wider(id_cols = EVENT, names_from = Sample, values_from = PSI) |>
column_to_rownames('EVENT') |>
as.matrix() -> tidy_matPlot the PCA using the function showme_PCA2D form my R package and save it.
showme_PCA2D(log2(tidy_mat), mt = mtdf, mcol = 'Sample',
real_aspect_ratio = F, m_fill = 'CellType', m_label = 'Sample',
n_top_var = 1000)ggsave(filename = '10_tidy_PCA_HeLa_HepG2.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 15,
height = 10)Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
The first principal component separates between the 2 cell lines and probably there’s a big influence in batch effect between the 2 different experiments.
showme_PCA2D(log2(tidy_mat), mt = mtdf, mcol = 'Sample', x = 2,
real_aspect_ratio = F, m_fill = 'CellType',
m_label = 'Sample', n_top_var = 1000)The second and third components start separating the data based on the biological KD, but the variance explained in the data is minimal.
It’s probably better to analyse the HeLa cells separately from the HepG2 to focus more on true biological effect.
tidy_events_HeLa <- tidy_events_backup[grepl(pattern = 'HeLa', x = tidy_events_backup$Sample, perl = T), ]
tidy_events_HeLa |>
select(GENE, EVENT, Sample, PSI) |>
pivot_wider(id_cols = EVENT, names_from = Sample, values_from = PSI) |>
column_to_rownames('EVENT') |>
as.matrix() -> tidy_mat_HeLa
showme_PCA2D(mat = log2(tidy_mat_HeLa), mt = mtdf, mcol = 'Sample', real_aspect_ratio = F,
m_fill = 'PrettyGroup', m_label = 'PrettySample', n_top_var = 1000)ggsave(filename = '11_tidy_PCA_HeLa.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 15,
height = 10)The PC1 is separating the SON KD down-samples while PC2 is separating SRRM2.
tidy_events_HepG2 <- tidy_events_backup[grepl(pattern = 'HepG2', x = tidy_events_backup$Sample, perl = T), ]
tidy_events_HepG2 |>
select(GENE, EVENT, Sample, PSI) |>
pivot_wider(id_cols = EVENT, names_from = Sample, values_from = PSI) |>
column_to_rownames('EVENT') |>
as.matrix() -> tidy_mat_HepG2
showme_PCA2D(mat = log2(tidy_mat_HepG2), mt = mtdf, mcol = 'Sample', real_aspect_ratio = F,
m_fill = 'PrettyGroup', m_label = 'PrettySample', n_top_var = 1000)ggsave(filename = '12_tidy_PCA_HepG2.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 15,
height = 10)Here, PC1 is separating a bit of everything, and again PC2 is splitting SRRM2 vs MARS KD. In the HepG2 dataset the samples are a bit more spread out, with replicates Ca and Cb being a bit of an outlier.
Explore the samples with a heatmap of the AS events. First let’s check how many events are missing the PSI (e.g. not enough coverage). In the next heatmap the colour white represent NA values and black represent a quantification (from 0 to 100). Of course the merged super samples have more coverage. I don’t think there’s a specific patter and I’d say that the values are missing at random.
cairo_pdf(file = file.path(plot_dir, '13_tidy_missing_values_heatmap.pdf'),
width = 5.5, height = 8, bg = NA)
pheatmap(tidy_mat, na_col = 'white', cluster_rows = F, cluster_cols = T,
legend = F, show_rownames = F, color = 'black',
main = paste0( nrow(tidy_mat), ' AS events' ) )dev.off()cairo_pdf
3
One can see that the HepG2 samples are probably more deeply sequenced and have less missing values.
Since the the super samples have more coverage let’s plot only the AS events in those 3 columns. NA are again displayed in white.
mat <- tidy_mat[, c('HeLa_CNTRL', 'HeLa_SRRM2', 'HepG2_CNTRL', 'HepG2_SRRM2')]
pheatmap(mat = mat, na_col = 'white', cluster_rows = F, cluster_cols = T,
main = paste0( nrow(mat), ' AS events'), angle_col = 90,
show_rownames = F, color = viridis(n = 100), cutree_cols = 4)rm(mat)The trend is pretty much the same, but HepG2 have more coverage than HeLa cells.
To detect subtle changes with a more statistical framework I try to calculate a P value using the individual replicates PSI and plot them against the ∆PSI as a Volcano plot. To achieve this first define the sample types.
mrgd_cntrl <- "HeLa_CNTRL"
mrgd_SRRM2 <- "HeLa_SRRM2"
CNTRLKD <- c("HeLa_CNTRL_A", "HeLa_CNTRL_B", "HeLa_CNTRL_C")
SRRM2KD <- c("HeLa_SRRM2_A", "HeLa_SRRM2_B", "HeLa_SRRM2_C")Filter for samples that have finite ∆PSI (meaning that is not just NA in one of the mrgd samples) and that have maximum one sample per condition as NA (meaning at least 2 samples are not NA). This is step is done only on the SRRM2 KD in HeLa cells.
te_HeLa_SRRM2 <- tidy_events_HeLa[grepl(pattern = '(CNTRL|SRRM2)', x = tidy_events_HeLa$Sample, perl = T), ]Filter the data.
te_HeLa_SRRM2 |>
group_by(EVENT) |>
summarise(mPSI_CNTRL = mean(PSI[Sample %in% mrgd_cntrl], na.rm = T),
mPSI_SRRM2 = mean(PSI[Sample %in% mrgd_SRRM2], na.rm = T),
Samples_NA_CNTRL = sum(is.na(PSI[Sample %in% CNTRLKD])),
Samples_NA_SRRM2 = sum(is.na(PSI[Sample %in% SRRM2KD])) ) |>
# calculate ∆PSI
mutate(dPSI_SRRM2 = mPSI_SRRM2 - mPSI_CNTRL ) |>
subset( is.finite(dPSI_SRRM2) ) |>
# select samples with max 1 NA PSI value per event
# i.e. at least 2 samples with a finite PSI
subset( Samples_NA_CNTRL <= 1) |>
subset( Samples_NA_SRRM2 <= 1) |>
select(EVENT, dPSI_SRRM2) |>
unique() |>
# Add the ∆PSI of the filtered events to the original data frame
left_join(y = te_HeLa_SRRM2, by = join_by(EVENT)) |>
relocate(GENE, .before = EVENT) |>
relocate(starts_with('dPSI'), .after = PSI) |>
# add metadata information
left_join(y = mtdf, by = join_by(Sample)) |>
# remove merged super sample PSIs
subset( !Replicate == 'Merge') |>
group_by(EVENT) |>
# for the statistical test select only events that are changing a bit to enter the test
# if not the data is essentially constant and statistical test fails
# this is calculated by measuring the PSI difference standard deviation that should not be zero
mutate(dSD_SRRM2 = sd(PSI[Sample %in% CNTRLKD] - PSI[Sample %in% SRRM2KD],
na.rm = T) ) |>
subset(dSD_SRRM2 != 0) |>
select( !c(starts_with("dSD_"), PrettySample, KD, CellType)) |> # remove these columns
# In addition one could also select the events that have a ∆PSI different from zero.
# subset( abs(dPSI_SRRM2) >= 0.01 ) |>
ungroup() -> tidy_dPSI_HeLa
message(length(unique(tidy_dPSI_HeLa$EVENT)), '/',
length(unique(tidy_events$EVENT)), ' (',
100*signif(length(unique(tidy_dPSI_HeLa$EVENT))/length(unique(tidy_events$EVENT)), 2), '%) ',
'filtered events that enter the statistical analysis')10821/34490 (31%) filtered events that enter the statistical analysis
The following analysis uses Student’s t-test on PSI values that are bounded between 0 and 100. Some people might find this “statistically wrong” as the PSI distribution does not follow normality. My argument is that PSI is close enough to normality to allow for the use of a t-test and the test itself is pretty robust to non-normality anyway. This means that I believe I can use the t-test for this data; not that I should.
Calculate P values for SRRM2 KD in HeLa.
tidy_dPSI_HeLa |>
group_by(EVENT) |>
t_test(formula = PSI ~ PrettyGroup, ref.group = 'HeLa_CNTRL',
alternative = "two.sided", paired = F, p.adjust.method = 'none',
conf.level = 0.95, detailed = F) |>
adjust_pvalue(method = "BH") |> add_significance("p.adj") |>
select(EVENT, p, p.adj, p.adj.signif) |>
setNames(c('EVENT', 'p_SRRM2_HeLa', 'padj_SRRM2_HeLa',
'padj_signif_SRRM2_HeLa')) -> pvals_SRRM2_HeLaCombine all data frames into one. If an event as a P value that is NA, discard it.
left_join(x = tidy_dPSI_HeLa, y = pvals_SRRM2_HeLa, by = join_by(EVENT)) |>
subset(!is.na(p_SRRM2_HeLa)) |>
subset(!is.na(padj_SRRM2_HeLa)) -> tidy_dPSI_pval_HeLaCheck the P value distribution.
tidy_dPSI_pval_HeLa |>
select(EVENT, p_SRRM2_HeLa) |>
pivot_longer(cols = starts_with("p")) |>
ggplot() +
aes(x = value, fill = name) +
geom_histogram(bins = 25, colour = 'black', show.legend = F) +
labs(x = "P values") + theme_bw() Well that’s not that promising but at least it is somewhat uniform.
To control for this I set a non-corrected P value threshold to 0.01, which is more conservative, plus the size effect threshold is |∆PSI| >= 10.
dPSI_squish <- 40
tidy_dPSI_pval_HeLa <- tidy_dPSI_pval_HeLa |>
# label the events that are below significant thresholds
mutate(DSE_SRRM2_HeLa = p_SRRM2_HeLa < pval_thrshld & abs(dPSI_SRRM2) >= dPSI_thrshld ) Volcano plot of SRRM2 KD in HeLa. Points with |∆PSI| > 40 are squished on the sides of the plot.
# select fewer columns for plotting
subset(tidy_dPSI_pval_HeLa) |>
select(GENE, EVENT, dPSI_SRRM2, p_SRRM2_HeLa, padj_SRRM2_HeLa, DSE_SRRM2_HeLa) |>
unique() -> plot_HELA
ggplot(plot_HELA) +
aes(x = dPSI_SRRM2, y = -log10(p_SRRM2_HeLa), fill = DSE_SRRM2_HeLa) +
geom_hline(yintercept = -log10(pval_thrshld), linewidth = 0.1, colour = 'black') +
geom_vline(xintercept = -dPSI_thrshld, linewidth = 0.1, colour = 'black') +
geom_vline(xintercept = dPSI_thrshld, linewidth = 0.1, colour = 'black') +
geom_text_repel(data = subset(plot_HELA, DSE_SRRM2_HeLa),
aes(label = GENE, x = dPSI_SRRM2, y = -log10(p_SRRM2_HeLa)),
seed = 16, show.legend = F, segment.curvature = -1e-20,
family = "Arial", size = 2, nudge_x = -0.25,
segment.color = 'black', verbose = F, lwd = 0.1,
box.padding = grid::unit(1, "mm"),
point.padding = grid::unit(0.55, "mm"),
max.overlaps = 20 ) +
geom_point(shape = 21, show.legend = F, alpha = 1, stroke = 0.2) +
scale_fill_manual(values = c('gray84', 'firebrick3') ) +
scale_x_continuous(n.breaks = 10, oob = scales::squish, limits = c(-dPSI_squish, dPSI_squish),
expand = expansion(mult = 0, add = 0.5)) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.05), add = 0) ) +
coord_cartesian(ylim = c(0, 6.05)) + # set this to have the 2 volcano plots with the same Y-axis range
labs( x = "\u0394PSI (SRRM2 KD - Cntrl)", y = expression(-log[10] ~ "P value") ) +
theme_classic(base_family = 'Arial') +
theme(axis.text = element_text(colour = 'black'),
axis.line = element_line(colour = 'black', linewidth = 0.2),
panel.grid.major = element_line(linewidth = 0.1),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
plot.background = element_blank())ggsave(filename = '14_tidy_SRRM2_Volcano_HeLa.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 10,
height = 12)Check the PSI of some of them.
subset(tidy_dPSI_pval_HeLa) |>
select(GENE, EVENT, dPSI_SRRM2, p_SRRM2_HeLa, padj_SRRM2_HeLa, padj_signif_SRRM2_HeLa, DSE_SRRM2_HeLa) |>
subset(DSE_SRRM2_HeLa) |>
unique() |>
arrange(padj_SRRM2_HeLa) |>
head(10) |>
pull(EVENT) -> top_tidy_events_HeLaGrep the PSIs as not all of them are in the compared table. This my take a while…
grep_psi(inclusion_tbl = psi_path, vst_id = top_tidy_events_HeLa) |>
tidy_vst_psi() -> tmpPlot some PSIs.
left_join( x = tmp, y = mtdf, by = join_by(Sample)) |>
subset(CellType == 'HeLa' ) |>
plot_PSI_boxplot(num_col = 5, order_by_dPSI = F, legend_xy = c(1, 3))As one can see in the plot above, all these events have very good accordance between replicates (i.e. low standard deviation). So applying the Student’s t-test on PSI values with a low P values filter (0.01) is a good approach for detecting events that are clearly differentially spliced.
Select the differentially spliced events in SRRM2 KD HeLa.
subset(tidy_dPSI_pval_HeLa) |>
select(GENE, EVENT, dPSI_SRRM2, p_SRRM2_HeLa, padj_SRRM2_HeLa, DSE_SRRM2_HeLa) |>
subset(DSE_SRRM2_HeLa) |>
unique() -> tidy_signif_HeLa
up_tidy_HeLa <- subset(tidy_signif_HeLa, dPSI_SRRM2 >= 0)
do_tidy_HeLa <- subset(tidy_signif_HeLa, dPSI_SRRM2 < 0)
EX_up_tidy_HeLa <- grep(pattern = "^HsaEX", x = up_tidy_HeLa$EVENT, value = T)
EX_do_tidy_HeLa <- grep(pattern = "^HsaEX", x = do_tidy_HeLa$EVENT, value = T)
IN_up_tidy_HeLa <- grep(pattern = "^HsaINT", x = up_tidy_HeLa$EVENT, value = T)
IN_do_tidy_HeLa <- grep(pattern = "^HsaINT", x = do_tidy_HeLa$EVENT, value = T)
A3_up_tidy_HeLa <- grep(pattern = "^HsaALTA", x = up_tidy_HeLa$EVENT, value = T)
A3_do_tidy_HeLa <- grep(pattern = "^HsaALTA", x = do_tidy_HeLa$EVENT, value = T)
A3_tidy_HeLa <- match_alt35_id(up_ids = A3_up_tidy_HeLa, do_ids = A3_do_tidy_HeLa)
A5_up_tidy_HeLa <- grep(pattern = "^HsaALTD", x = up_tidy_HeLa$EVENT, value = T)
A5_do_tidy_HeLa <- grep(pattern = "^HsaALTD", x = do_tidy_HeLa$EVENT, value = T)
A5_tidy_HeLa <- match_alt35_id(up_ids = A5_up_tidy_HeLa, do_ids = A5_do_tidy_HeLa)Print the number of differentially spliced events found with the tidy approach.
message('HeLa SRRM2 KD tidy method:',
'\nUP Exon: ', length(EX_up_tidy_HeLa),
'\nDOWN Exon: ', length(EX_do_tidy_HeLa),
'\nUP Intron: ', length(IN_up_tidy_HeLa),
'\nDOWN Intron: ', length(IN_do_tidy_HeLa),
"\nDifferentially spliced 3' splice sites: ", length(A3_tidy_HeLa),
"\nDifferentially spliced 5' splice sites: ", length(A5_tidy_HeLa) )HeLa SRRM2 KD tidy method:
UP Exon: 17
DOWN Exon: 36
UP Intron: 1
DOWN Intron: 2
Differentially spliced 3' splice sites: 7
Differentially spliced 5' splice sites: 4
The tidy method identifies less events, as it is stringent.
Now repeat the same thing but for HepG2 cells
mrgd_cntrl <- "HepG2_CNTRL"
mrgd_SRRM2 <- "HepG2_SRRM2"
CNTRLKD <- c("HepG2_CNTRL_A", "HepG2_CNTRL_B", "HepG2_CNTRL_C")
SRRM2KD <- c("HepG2_SRRM2_Aa", "HepG2_SRRM2_Ab",
"HepG2_SRRM2_Ba", "HepG2_SRRM2_Bb",
"HepG2_SRRM2_Ca", "HepG2_SRRM2_Cb")Filter for samples that have finite ∆PSI (meaning that is not just NA in one of the mrgd samples) and that have maximum one sample per condition as NA (meaning at least 2 samples are not NA). This is step is done only on the SRRM2 KD in HeLa cells.
te_HepG2_SRRM2 <- tidy_events_HepG2[grepl(pattern = '(CNTRL|SRRM2)', x = tidy_events_HepG2$Sample, perl = T), ]Filter the data in the same way.
te_HepG2_SRRM2 |>
group_by(EVENT) |>
summarise(mPSI_CNTRL = mean(PSI[Sample %in% mrgd_cntrl], na.rm = T),
mPSI_SRRM2 = mean(PSI[Sample %in% mrgd_SRRM2], na.rm = T),
Samples_NA_CNTRL = sum(is.na(PSI[Sample %in% CNTRLKD])),
Samples_NA_SRRM2 = sum(is.na(PSI[Sample %in% SRRM2KD])) ) |>
# calculate ∆PSI
mutate(dPSI_SRRM2 = mPSI_SRRM2 - mPSI_CNTRL ) |>
subset( is.finite(dPSI_SRRM2) ) |>
# select samples with max 1 NA PSI value per event
subset( Samples_NA_CNTRL <= 1) |>
subset( Samples_NA_SRRM2 <= 1) |>
select(EVENT, dPSI_SRRM2) |>
unique() |>
# Add the ∆PSI of the filtered events to the original data frame
left_join(y = te_HepG2_SRRM2, by = join_by(EVENT)) |>
relocate(GENE, .before = EVENT) |>
relocate(starts_with('dPSI'), .after = PSI) |>
# add metadata information
left_join(y = mtdf, by = join_by(Sample)) |>
# remove merged super sample PSIs
subset( !Replicate == 'Merge') |>
group_by(EVENT) |>
# for the statistical test select only events that are changing a bit to enter the test
# if not the data is essentially constant and statistical test fails
# this is calculated by measuring the PSI difference standard deviation that should not be zero
mutate(dSD_SRRM2 = sd(PSI[Sample %in% CNTRLKD] - PSI[Sample %in% SRRM2KD],
na.rm = T) ) |>
subset(dSD_SRRM2 != 0) |>
select( !c(starts_with("dSD_"), PrettySample, KD, CellType)) |> # remove these columns
# In addition one could also select the events that have a ∆PSI different from zero.
# subset( abs(dPSI_SRRM2) >= 0.01 ) |>
ungroup() -> tidy_dPSI_HepG2
message(length(unique(tidy_dPSI_HepG2$EVENT)), '/',
length(unique(tidy_events$EVENT)), ' (',
100*signif(length(unique(tidy_dPSI_HepG2$EVENT))/length(unique(tidy_events$EVENT)), 2), '%) ',
'filtered events that enter the statistical analysis')13344/34490 (39%) filtered events that enter the statistical analysis
Calculate P values for SRRM2 KD in HepG2.
tidy_dPSI_HepG2 |>
group_by(EVENT) |>
t_test(formula = PSI ~ PrettyGroup, ref.group = 'HepG2_CNTRL', alternative = "two.sided",
paired = F, p.adjust.method = 'none', conf.level = 0.95, detailed = F) |>
adjust_pvalue(method = "BH") |> add_significance("p.adj") |>
select(EVENT, p, p.adj, p.adj.signif) |>
setNames(c('EVENT', 'p_SRRM2_HepG2', 'padj_SRRM2_HepG2', 'padj_signif_SRRM2_HepG2')) -> pvals_SRRM2_HepG2Combine all data frames into one. If an event as a P value that is NA, discard it.
left_join(x = tidy_dPSI_HepG2, y = pvals_SRRM2_HepG2, by = join_by(EVENT)) |>
subset(!is.na(p_SRRM2_HepG2)) |>
subset(!is.na(padj_SRRM2_HepG2)) -> tidy_dPSI_pval_HepG2Check the P value distribution in HepG2 cells.
tidy_dPSI_pval_HepG2 |>
select(EVENT, p_SRRM2_HepG2) |>
pivot_longer(cols = starts_with("p")) |>
ggplot() +
aes(x = value, fill = name) +
geom_histogram(bins = 25, colour = 'black', show.legend = F) +
labs(x = "P values") + theme_bw() dPSI_squish <- 40
tidy_dPSI_pval_HepG2 <- tidy_dPSI_pval_HepG2 |>
# label the events that are below significant thresholds
mutate(DSE_SRRM2_HepG2 = p_SRRM2_HepG2 < pval_thrshld & abs(dPSI_SRRM2) >= dPSI_thrshld ) Volcano plot of SRRM2 KD. Points with |∆PSI| > 40 are squished on the sides of the plot.
# select fewer columns for plotting
subset(tidy_dPSI_pval_HepG2) |>
select(GENE, EVENT, dPSI_SRRM2, p_SRRM2_HepG2, padj_SRRM2_HepG2, DSE_SRRM2_HepG2) |>
unique() -> plot_HepG2
ggplot(plot_HepG2) +
aes(x = dPSI_SRRM2, y = -log10(p_SRRM2_HepG2), fill = DSE_SRRM2_HepG2) +
geom_hline(yintercept = -log10(pval_thrshld), linewidth = 0.1, colour = 'black') +
geom_vline(xintercept = -dPSI_thrshld, linewidth = 0.1, colour = 'black') +
geom_vline(xintercept = dPSI_thrshld, linewidth = 0.1, colour = 'black') +
geom_text_repel(data = subset(plot_HepG2, DSE_SRRM2_HepG2),
aes(label = GENE, x = dPSI_SRRM2, y = -log10(p_SRRM2_HepG2)),
seed = 16, show.legend = F, segment.curvature = -1e-20,
family = "Arial", size = 2, nudge_x = -0.25,
segment.color = 'black', verbose = F, lwd = 0.1,
box.padding = grid::unit(1, "mm"),
point.padding = grid::unit(0.55, "mm"),
max.overlaps = 20 ) +
geom_point(shape = 21, show.legend = F, alpha = 1, stroke = 0.2) +
scale_fill_manual(values = c('gray84', 'firebrick3') ) +
scale_x_continuous(n.breaks = 10, oob = scales::squish, limits = c(-dPSI_squish, dPSI_squish),
expand = expansion(mult = 0, add = 0.5)) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.05), add = 0) ) +
coord_cartesian(ylim = c(0, 6.05)) + # set this to have the 2 volcano plots with the same Y-axis range
labs( x = "\u0394PSI (SRRM2 KD - Cntrl)", y = expression(-log[10] ~ "P value") ) +
theme_classic(base_family = 'Arial') +
theme(axis.text = element_text(colour = 'black'),
axis.line = element_line(colour = 'black', linewidth = 0.2),
panel.grid.major = element_line(linewidth = 0.1),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
plot.background = element_blank())ggsave(filename = '15_tidy_SRRM2_Volcano_HepG2.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 10,
height = 12)Select the differentially spliced events in SRRM2 KD HepG2 cells.
subset(tidy_dPSI_pval_HepG2) |>
select(GENE, EVENT, DSE_SRRM2_HepG2, dPSI_SRRM2, p_SRRM2_HepG2, padj_SRRM2_HepG2, DSE_SRRM2_HepG2) |>
subset(DSE_SRRM2_HepG2) |>
unique() -> tidy_signif_HepG2
up_tidy_HepG2 <- subset(tidy_signif_HepG2, dPSI_SRRM2 >= 0)
do_tidy_HepG2 <- subset(tidy_signif_HepG2, dPSI_SRRM2 < 0)
EX_up_tidy_HepG2 <- grep(pattern = "^HsaEX", x = up_tidy_HepG2$EVENT, value = T)
EX_do_tidy_HepG2 <- grep(pattern = "^HsaEX", x = do_tidy_HepG2$EVENT, value = T)
IN_up_tidy_HepG2 <- grep(pattern = "^HsaINT", x = up_tidy_HepG2$EVENT, value = T)
IN_do_tidy_HepG2 <- grep(pattern = "^HsaINT", x = do_tidy_HepG2$EVENT, value = T)
A3_up_tidy_HepG2 <- grep(pattern = "^HsaALTA", x = up_tidy_HepG2$EVENT, value = T)
A3_do_tidy_HepG2 <- grep(pattern = "^HsaALTA", x = do_tidy_HepG2$EVENT, value = T)
A3_tidy_HepG2 <- match_alt35_id(up_ids = A3_up_tidy_HepG2, do_ids = A3_do_tidy_HepG2)
A5_up_tidy_HepG2 <- grep(pattern = "^HsaALTD", x = up_tidy_HepG2$EVENT, value = T)
A5_do_tidy_HepG2 <- grep(pattern = "^HsaALTD", x = do_tidy_HepG2$EVENT, value = T)
A5_tidy_HepG2 <- match_alt35_id(up_ids = A5_up_tidy_HepG2, do_ids = A5_do_tidy_HepG2)Print the number of DSE found with the tidy method.
message('HepG2 SRRM2 KD tidy method:',
'\nUP Exon: ', length(EX_up_tidy_HepG2),
'\nDOWN Exon: ', length(EX_do_tidy_HepG2),
'\nUP Intron: ', length(IN_up_tidy_HepG2),
'\nDOWN Intron: ', length(IN_do_tidy_HepG2),
"\nDifferentially spliced 3' splice sites: ", length(A3_tidy_HepG2),
"\nDifferentially spliced 5' splice sites: ", length(A5_tidy_HepG2) )HepG2 SRRM2 KD tidy method:
UP Exon: 114
DOWN Exon: 38
UP Intron: 1
DOWN Intron: 114
Differentially spliced 3' splice sites: 20
Differentially spliced 5' splice sites: 10
Import the vast-tools diff tables. Note: diff outputs PSIs from 0 to 1, so I multiply them by 100.
This tables were create with this command line short cut:
tail -n +2 Diff_output_all_events_noB3_pIR.tab | \
awk '{OFS="\t"} { if($6 >= 0.15) print $1, $2, $5, $6}' | \
sort -k3,4nr | sed -e $'1iGENE\tEVENT\tdPSI\tMVdPSIat95' > FltrdWhich filters for events with a minimum value of |∆PSI| >= 15 with a 95% confidence (column number 6) and sorts them by observed ∆PSI (column number 5).
diff_HeLa <- read_vst_tbl(path = diff_HeLa_SRRM2_path, show_col_types = FALSE) |>
mutate(dPSI = dPSI * 100) |>
mutate(MVdPSIat95 = MVdPSIat95 * 100) |>
mutate(COMPLEX = case_when( grepl(x = EVENT, pattern = '^HsaEX') ~ 'S',
grepl(x = EVENT, pattern = '^HsaINT') ~ 'IR',
grepl(x = EVENT, pattern = '^HsaALTD') ~ 'Alt5',
grepl(x = EVENT, pattern = '^HsaALTA') ~ 'Alt3') ) |>
mutate(comparison = 'HeLa')
diff_HepG2 <- read_vst_tbl(path = diff_HepG2_SRRM2_path, show_col_types = FALSE) |>
mutate(dPSI = dPSI * 100) |>
mutate(MVdPSIat95 = MVdPSIat95 * 100) |>
mutate(COMPLEX = case_when( grepl(x = EVENT, pattern = '^HsaEX') ~ 'S',
grepl(x = EVENT, pattern = '^HsaINT') ~ 'IR',
grepl(x = EVENT, pattern = '^HsaALTD') ~ 'Alt5',
grepl(x = EVENT, pattern = '^HsaALTA') ~ 'Alt3') ) |>
mutate(comparison = 'HepG2')How many AS events?
diff_events_HeLa <- unique(diff_HeLa$EVENT)
message(length(diff_events_HeLa), ' diff events in HeLa upon SRRM2 KD')188 diff events in HeLa upon SRRM2 KD
diff_events_HepG2 <- unique(diff_HepG2$EVENT)
message(length(diff_events_HepG2), ' diff events in HepG2 upon SRRM2 KD')924 diff events in HepG2 upon SRRM2 KD
diff_events_common <- intersect(diff_events_HeLa, diff_events_HepG2)
message(length(diff_events_common), ' diff events common between both KD')3 diff events common between both KD
# make a list with the DSE of each KD
diff_events_HeLa_only <- diff_events_HeLa[( !diff_events_HeLa %in% diff_events_common)]
diff_events_HepG2_only <- diff_events_HepG2[( !diff_events_HepG2 %in% diff_events_common)]Interestingly not many events are common between the 2 KD.
Plot the actual ∆PSI of the events that were filtered to have a minimum value of |∆PSI| >=15 with a 95% confidence.
plot_dPSI_hist_grpd(diff_HeLa) +
geom_vline(xintercept = -15) + geom_vline(xintercept = 15) ggsave(filename = '16_diff_dPSI_SRRM2-KD_HeLa_Histogram.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 12,
height = 6)There seem to be a good balance between up and down regulated events.
plot_dPSI_hist_grpd(diff_HepG2) +
geom_vline(xintercept = -15) + geom_vline(xintercept = 15) ggsave(filename = '17_diff_dPSI_SRRM2-KD_HepG2_Histogram.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 12,
height = 6)These seems to be more down regulated AS events in HepG2 SRRM2 KD.
The diff table does not contain all the information about the AS event as the main full inclusion table so I get such info from the full inclusion table.
fetch_diff_IDs <- unique(c(diff_events_HeLa, diff_events_HepG2))
message(length(fetch_diff_IDs), ' events to get info... this may take some time')1109 events to get info... this may take some time
This next step might take some time to process.
grep_psi(inclusion_tbl = psi_path, vst_id = fetch_diff_IDs) |>
tidy_vst_psi() -> diff_psi_dfTo avoid having to repeat this step in the future I save this dataframe to file and not execute the code chunk above any more.
write_delim(x = diff_psi_df, delim = '\t',
file = file.path(diff_dir, 'Fltrd_Annotated_Diff_Events_HeLa_Both_OE.tab') ) Import the table that was written the first time.
diff_psi_df <- read_vst_tbl(show_col_types = FALSE,
path = file.path(diff_dir, 'Fltrd_Annotated_Diff_Events_KD.tab') ) |>
left_join(y = mtdf, by = join_by(Sample))Separate between the exons and the introns that are up or down spliced in HeLa SRRM2 KD
up_diff_HeLa <- subset(diff_HeLa, dPSI >= 0)
do_diff_HeLa <- subset(diff_HeLa, dPSI < 0)
EX_up_diff_HeLa <- grep(pattern = "^HsaEX", x = up_diff_HeLa$EVENT, value = T)
EX_do_diff_HeLa <- grep(pattern = "^HsaEX", x = do_diff_HeLa$EVENT, value = T)
IN_up_diff_HeLa <- grep(pattern = "^HsaINT", x = up_diff_HeLa$EVENT, value = T)
IN_do_diff_HeLa <- grep(pattern = "^HsaINT", x = do_diff_HeLa$EVENT, value = T)
A3_up_diff_HeLa <- grep(pattern = "^HsaALTA", x = up_diff_HeLa$EVENT, value = T)
A3_do_diff_HeLa <- grep(pattern = "^HsaALTA", x = do_diff_HeLa$EVENT, value = T)
A3_diff_HeLa <- match_alt35_id(up_ids = A3_up_tidy_HeLa, do_ids = A3_do_tidy_HeLa)
A5_up_diff_HeLa <- grep(pattern = "^HsaALTD", x = up_diff_HeLa$EVENT, value = T)
A5_do_diff_HeLa <- grep(pattern = "^HsaALTD", x = do_diff_HeLa$EVENT, value = T)
A5_diff_HeLa <- match_alt35_id(up_ids = A5_up_diff_HeLa, do_ids = A5_do_diff_HeLa)Print the number.
message('HeLa SRRM2 KD diff method:',
'\nUP Exon: ', length(EX_up_diff_HeLa),
'\nDOWN Exon: ', length(EX_do_diff_HeLa),
'\nUP Intron: ', length(IN_up_diff_HeLa),
'\nDOWN Intron: ', length(IN_do_diff_HeLa),
"\nDifferentially spliced 3' splice sites: ", length(A3_diff_HeLa),
"\nDifferentially spliced 5' splice sites: ", length(A5_diff_HeLa) )HeLa SRRM2 KD diff method:
UP Exon: 39
DOWN Exon: 33
UP Intron: 42
DOWN Intron: 26
Differentially spliced 3' splice sites: 7
Differentially spliced 5' splice sites: 25
Separate between the exons and the introns that are up or down spliced in HepG2 SRRM2 KD
up_diff_HepG2 <- subset(diff_HepG2, dPSI >= 0)
do_diff_HepG2 <- subset(diff_HepG2, dPSI < 0)
EX_up_diff_HepG2 <- grep(pattern = "^HsaEX", x = up_diff_HepG2$EVENT, value = T)
EX_do_diff_HepG2 <- grep(pattern = "^HsaEX", x = do_diff_HepG2$EVENT, value = T)
IN_up_diff_HepG2 <- grep(pattern = "^HsaINT", x = up_diff_HepG2$EVENT, value = T)
IN_do_diff_HepG2 <- grep(pattern = "^HsaINT", x = do_diff_HepG2$EVENT, value = T)
A3_up_diff_HepG2 <- grep(pattern = "^HsaALTA", x = do_diff_HepG2$EVENT, value = T)
A3_do_diff_HepG2 <- grep(pattern = "^HsaALTA", x = do_diff_HepG2$EVENT, value = T)
A3_diff_HepG2 <- match_alt35_id(up_ids = A3_up_diff_HepG2, do_ids = A3_do_diff_HepG2)
A5_up_diff_HepG2 <- grep(pattern = "^HsaALTD", x = do_diff_HepG2$EVENT, value = T)
A5_do_diff_HepG2 <- grep(pattern = "^HsaALTD", x = do_diff_HepG2$EVENT, value = T)
A5_diff_HepG2 <- match_alt35_id(up_ids = A5_up_diff_HepG2, do_ids = A5_do_diff_HepG2)Print the number of DSE found with the diff method.
message('HepG2 SRRM2 KD diff method:',
'\nUP Exon: ', length(EX_up_diff_HepG2),
'\nDOWN Exon: ', length(EX_do_diff_HepG2),
'\nUP Intron: ', length(IN_up_diff_HepG2),
'\nDOWN Intron: ', length(IN_do_diff_HepG2),
"\nDifferentially spliced 3' splice sites: ", length(A3_diff_HepG2),
"\nDifferentially spliced 5' splice sites: ", length(A5_diff_HepG2))HepG2 SRRM2 KD diff method:
UP Exon: 64
DOWN Exon: 133
UP Intron: 63
DOWN Intron: 408
Differentially spliced 3' splice sites: 89
Differentially spliced 5' splice sites: 90
Checking the SRRM2 expression across the dataset.
grep_gene_expression(vst_expression_tbl = expr_path,
ensembl_gene_id = 'ENSG00000167978') |>
tidy_vst_expr(expression_unit = 'TPM' ) |>
left_join(y = mtdf, by = join_by(Sample) ) |>
ggplot() +
aes(x = PrettyGroup, y = log2(Gene_Expr)) +
geom_boxplot(fill = NA, outlier.shape = NA, lwd = 0.2) +
geom_point( aes(shape = Replicate, fill = CellType), show.legend = FALSE,
size = 1.5, stroke = 0.2, position = position_dodge(width = 0.5)) +
scale_shape_manual(values = c('Merge' = 22, 'A' = 21, 'Aa' = 21, 'Ab' = 21,
'B' = 21, 'Ba' = 21, 'Bb' = 21,
'C' = 21, 'Ca' = 21, 'Cb' = 21 ) ) +
scale_x_discrete(labels = \(x) gsub(pattern = '_', replacement = '\n', x) ) +
scale_fill_manual(values = sample_palette) +
coord_cartesian(ylim = c(5, 9)) +
labs(y = 'Norm. TPMs (log2)') +
guides(fill = guide_legend(override.aes = list(shape = 21)), shape = 'none') +
theme_classic() +
theme(axis.text = element_text(colour = "black"),
axis.line = element_line(linewidth = 0.2),
axis.title.x = element_blank(),
strip.background = element_blank(),
strip.text = element_text(vjust = 1, lineheight = 0,
margin = margin(t = 0, unit = 'mm')),
panel.grid.major = element_line(colour = 'gray84', linewidth = 0.15) ) ggsave(filename = '0_SRRM2_Expression_HeLa_HepG2_Boxplot.pdf', plot = last_plot(),
device = cairo_pdf, path = plot_dir, units = 'cm', width = 16,
height = 8)The square point is the merged replicates into a super sample. The KD seems to work.
Plots should be improved.
EX_up_HeLa <- unique(c(EX_up_cmpr_HeLa, EX_up_tidy_HeLa, EX_up_diff_HeLa))
EX_do_HeLa <- unique(c(EX_do_cmpr_HeLa, EX_do_tidy_HeLa, EX_do_diff_HeLa))
EX_up_HepG2 <- unique(c(EX_up_cmpr_HepG2, EX_up_tidy_HepG2, EX_up_diff_HepG2))
EX_do_HepG2 <- unique(c(EX_do_cmpr_HepG2, EX_do_tidy_HepG2, EX_do_diff_HepG2))
dsexons_list <- list(EX_up_HeLa, EX_do_HeLa, EX_up_HepG2, EX_do_HepG2)
names(dsexons_list) <- c('EX UP HeLa', 'EX DOWN HeLa', 'EX UP HepG2', 'EX DOWN HepG2')intersection_exons_ids <- upset(fromList(dsexons_list), text.scale = 1.5,
nintersects = 6,
mb.ratio = c(0.7, 0.3),
point.size = 5, order.by = 'freq', keep.order = T,
sets = rev(names(dsexons_list)),
main.bar.color = "gray16") Show the number and overlap of events found by each method with an upset plot:
intersection_exons_idsSave to pdf.
cairo_pdf(file = file.path(plot_dir, '19_overlap_diff_exons_Upset_HeLa_HepG2.pdf'),
width = 5, height = 4.5, bg = NA)
intersection_exons_ids
dev.off()png
2
IN_up_HeLa <- unique(c(IN_up_cmpr_HeLa, IN_up_tidy_HeLa, IN_up_diff_HeLa))
IN_do_HeLa <- unique(c(IN_do_cmpr_HeLa, IN_do_tidy_HeLa, IN_do_diff_HeLa))
IN_up_HepG2 <- unique(c(IN_up_cmpr_HepG2, IN_up_tidy_HepG2, IN_up_diff_HepG2))
IN_do_HepG2 <- unique(c(IN_do_cmpr_HepG2, IN_do_tidy_HepG2, IN_do_diff_HepG2))
dsintrons_list <- list(IN_up_HeLa, IN_do_HeLa, IN_up_HepG2, IN_do_HepG2)
names(dsintrons_list) <- c('INT UP HeLa', 'INT DOWN HeLa', 'INT UP HepG2', 'INT DOWN HepG2')Make upset plot introns
intersection_introns_ids <- upset(fromList(dsintrons_list), text.scale = 1.5, nintersects = NA,
mb.ratio = c(0.7, 0.3),
point.size = 5, order.by = 'freq', keep.order = T,
sets = rev(names(dsintrons_list)),
main.bar.color = "gray16") Show the number and overlap of events found by each method with an upset plot:
intersection_introns_idscairo_pdf(file = file.path(plot_dir, '20_overlap_diff_introns_Upset.pdf'),
width = 6, height = 4.5, bg = NA)
intersection_introns_ids
dev.off()png
2
There is very little overlap between the events that are differentially spliced in HeLa or HepG2 cells upon SRRM2 knockdown!
A3_HeLa <- unique(c(A3_cmpr_HeLa, A3_tidy_HeLa, A3_diff_HeLa))
A5_HeLa <- unique(c(A5_cmpr_HeLa, A5_tidy_HeLa, A5_diff_HeLa))
A3_HepG2 <- unique(c(A3_cmpr_HepG2, A3_tidy_HepG2, A3_diff_HepG2))
A5_HepG2 <- unique(c(A5_cmpr_HepG2, A5_tidy_HepG2, A5_diff_HepG2))
dsaltss_list <- list(A3_HeLa, A3_HepG2, A5_HeLa, A5_HepG2)
names(dsaltss_list) <- c('ALT 3SS HeLa', 'ALT 3SS HepG2', 'ALT 5SS HeLa', 'ALT 5SS HepG2')Make upset plot alternative splice sites
intersection_ss_ids <- upset(fromList(dsaltss_list), text.scale = 1.5, nintersects = NA,
mb.ratio = c(0.7, 0.3),
point.size = 5, order.by = 'freq', keep.order = T,
sets = rev(names(dsaltss_list)),
main.bar.color = "gray16") Show the number and overlap of events found by each method with an upset plot:
intersection_ss_idscairo_pdf(file = file.path(plot_dir, '21_overlap_alt_splice_sites_Upset.pdf'),
width = 6.1, height = 5, bg = NA)
intersection_introns_ids
dev.off()png
2
Plot how many differentially Spliced Events by method? This code doesn’t separate between exons or introns but only between up or down-regulated events.
Check the number of DSE found by each method.
dse_list_HeLa <- list(up_cmpr_HeLa$EVENT, do_cmpr_HeLa$EVENT,
up_tidy_HeLa$EVENT, do_tidy_HeLa$EVENT,
up_diff_HeLa$EVENT, do_diff_HeLa$EVENT )
names(dse_list_HeLa) <- c('UP HeLa CMPR', 'DOWN HeLa CMPR',
'UP HeLa TIDY', 'DOWN HeLa TIDY',
'UP HeLa DIFF', 'DOWN HeLa DIFF' )
dse_list_HepG2 <- list(up_cmpr_HepG2$EVENT, do_cmpr_HepG2$EVENT,
up_tidy_HepG2$EVENT, do_tidy_HepG2$EVENT,
up_diff_HepG2$EVENT, do_diff_HepG2$EVENT)
names(dse_list_HepG2) <- c('UP HepG2 CMPR', 'DOWN HepG2 CMPR',
'UP HepG2 TIDY', 'DOWN HepG2 TIDY',
'UP HepG2 DIFF', 'DOWN HepG2 DIFF')intersection_methods_HeLa <- upset(fromList(dse_list_HeLa), nsets = length(dse_list_HeLa),
text.scale = 1.5, nintersects = NA,
mb.ratio = c(0.6, 0.4), decreasing = c(T, F),
order.by = 'freq', main.bar.color = "gray16") Show the number and overlap of events found by each method with an upset plot:
intersection_methods_HeLacairo_pdf(file = file.path(plot_dir, '22_overlap_HeLa_methods_ex_introns_Upset.pdf'),
width = 6, height = 4, bg = NA)
intersection_methods_HeLa
dev.off()png
2
Extract the HeLa bona fide (found by 3 different methods) UP and DOWN events
dse_list_HeLa_UP <- list(up_cmpr_HeLa$EVENT, up_tidy_HeLa$EVENT,
up_diff_HeLa$EVENT)
# find events that are more included in all 3 methods comparisons
BF_events_HeLa_UP <- Reduce(intersect, lapply(dse_list_HeLa_UP, function(x) x ) )
message("Num. events diff. upregulated in all 3 methods in HeLa: ", length(BF_events_HeLa_UP),
"\nEvent: ", BF_events_HeLa_UP)Num. events diff. upregulated in all 3 methods in HeLa: 1
Event: HsaEX0065944
dse_list_HeLa_DO <- list(do_cmpr_HeLa$EVENT, do_tidy_HeLa$EVENT, do_diff_HeLa$EVENT)
# find events that are more skipped in all 3 methods comparisons
BF_events_HeLa_DO <- Reduce(intersect, lapply(dse_list_HeLa_DO, function(x) x ) )
message("Num. events diff. downregulated in all 3 methods in HeLa: ", length(BF_events_HeLa_DO),
"\nEvents: ", paste0(BF_events_HeLa_DO, collapse = ", ") )Num. events diff. downregulated in all 3 methods in HeLa: 6
Events: HsaEX0035105, HsaEX0066187, HsaEX0062599, HsaEX0026508, HsaEX0026509, HsaEX0035021
Now for HepG2
intersection_methods_HepG2 <- upset(fromList(dse_list_HepG2), nsets = length(dse_list_HepG2),
text.scale = 1.5, nintersects = NA,
mb.ratio = c(0.6, 0.4), decreasing = c(T, F),
order.by = 'freq', main.bar.color = "gray16") Show the number and overlap of events found by each method with an upset plot:
intersection_methods_HepG2Save to pdf the HepG2 plot.
cairo_pdf(file = file.path(plot_dir, '22_overlap_HepG2_methods_ex_introns_Upset.pdf'),
width = 6, height = 4, bg = NA)
intersection_methods_HepG2
dev.off()png
2
Extract the HepG2 bona fide (found by 3 different methods) UP and DOWN events
dse_list_HepG2_UP <- list(up_cmpr_HepG2$EVENT, up_tidy_HepG2$EVENT,up_diff_HepG2$EVENT)
# find events that are more included in all 3 methods comparisons
BF_events_HepG2_UP <- Reduce(intersect, lapply(dse_list_HepG2_UP, function(x) x ) )
message("Num. events diff. upregulated in all 3 methods in HepG2: ", length(BF_events_HepG2_UP),
"\nEvent: ", paste0(BF_events_HepG2_UP, collapse = ", ") )Num. events diff. upregulated in all 3 methods in HepG2: 5
Event: HsaEX0041240, HsaEX0066371, HsaEX0037169, HsaEX0028178, HsaEX0028282
dse_list_HepG2_DO <- list(do_cmpr_HepG2$EVENT, do_tidy_HepG2$EVENT, do_diff_HepG2$EVENT)
# find events that are more skipped in all 3 methods comparisons
BF_events_HepG2_DO <- Reduce(intersect, lapply(dse_list_HepG2_DO, function(x) x ) )
message("Num. events diff. downregulated in all 3 methods in HepG2: ", length(dse_list_HepG2_DO) )Num. events diff. downregulated in all 3 methods in HepG2: 3
Select the best bona fide events that are up or down regulated in HeLa and HepG2 cells.
Best_SRRMS_events_IDs <- c(BF_events_HeLa_UP, BF_events_HeLa_DO, BF_events_HepG2_UP, BF_events_HepG2_DO) |> unique()
length(Best_SRRMS_events_IDs)[1] 70
Best_SRRMS_events_IDs_df <- data.frame(DSE_robust_HeLa_HepG2 = sort(Best_SRRMS_events_IDs))Save these exons and introns to file in case someone will need them in the future
write_delim(x = Best_SRRMS_events_IDs_df, col_names = T, append = F,
quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, 'Bona_Fide_Diff_Spliced_Exons_Introns_SRRM2-KD_HeLa_HepG2.tab'),
delim = '\t')In conclusion when analysing SRRM2 knockdown RNA-seq in HeLa and HepG2 from 2 different publications and using 3 different filtering methods to call for differentially spliced exons and introns, I found that only 70 exons and introns are reliably detected in both experiments in all 3 analysis methods.
Export the significant events and look for features using Matt.
Calculate the intersection of up or down regulated exons between HeLa and HepG2 cells as well as the uniquely mis-spliced exons in each cell line.
EX_up_common_HH <- intersect( EX_up_HeLa, EX_up_HepG2 )
EX_do_common_HH <- intersect( EX_up_HepG2, EX_do_HepG2 )
EX_up_uniq_HeLa <- EX_up_HeLa[ !(EX_up_HeLa %in% c(EX_up_common_HH, EX_up_HepG2) )]
EX_do_uniq_HeLa <- EX_do_HeLa[ !(EX_do_HeLa %in% c(EX_do_common_HH, EX_do_HepG2) )]
EX_up_uniq_HepG2 <- EX_up_HepG2[ !(EX_up_HepG2 %in% c(EX_up_common_HH, EX_up_HeLa) )]
EX_do_uniq_HepG2 <- EX_do_HepG2[ !(EX_do_HepG2 %in% c(EX_do_common_HH, EX_up_HeLa) )] Export to file the differentially spliced exons IDs
write_delim(x = as.data.frame(EX_up_common_HH), col_names = F, append = F,
quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '1_UP_EXONS_BOTH.tab'), delim = '\t')
write_delim(x = as.data.frame(EX_do_common_HH), col_names = F, append = F,
quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '2_DOWN_EXONS_BOTH.tab'), delim = '\t')
write_delim(x = as.data.frame(EX_up_uniq_HeLa), col_names = F, append = F,
quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '3_UP_EXONS_HeLa.tab'), delim = '\t')
write_delim(x = as.data.frame(EX_do_uniq_HeLa), col_names = F, append = F,
quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '4_DOWN_EXONS_HeLa.tab'), delim = '\t')
write_delim(x = as.data.frame(EX_up_uniq_HepG2), col_names = F, append = F,
quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '5_UP_EXONS_HepG2.tab'), delim = '\t')
write_delim(x = as.data.frame(EX_do_uniq_HepG2), col_names = F, append = F,
quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '6_DOWN_EXONS_HepG2.tab'), delim = '\t')Calculate the intersection of up or down regulated introns between HeLa and HepG2 cells as well as the uniquely mis-spliced introns in each cell line.
IN_up_common_HH <- intersect(IN_up_HeLa, IN_up_HepG2)
IN_do_common_HH <- intersect(IN_do_HeLa, IN_do_HepG2)
IN_up_uniq_HeLa <- IN_up_HeLa[ !(IN_up_HeLa %in% c(IN_up_common_HH, IN_up_HepG2)) ]
IN_do_uniq_HeLa <- IN_do_HeLa[ !(IN_do_HeLa %in% c(IN_do_common_HH, IN_do_HepG2)) ]
IN_up_uniq_HepG2 <- IN_up_HepG2[ !(IN_up_HepG2 %in% c(IN_up_common_HH, IN_up_HeLa)) ]
IN_do_uniq_HepG2 <- IN_do_HepG2[ !(IN_do_HepG2 %in% c(IN_do_common_HH, IN_do_HeLa)) ] Export to file the differentially spliced introns IDs
write_delim(x = as.data.frame(IN_up_common_HH), col_names = F, append = F,
quote = 'none', escape = 'none', delim = '\t',
file = file.path(dse_IDs_dir, '1_UP_INTRONS_BOTH.tab') )
write_delim(x = as.data.frame(IN_do_common_HH), col_names = F, append = F,
quote = 'none', escape = 'none', delim = '\t',
file = file.path(dse_IDs_dir, '2_DOWN_INTRONS_BOTH.tab') )
write_delim(x = as.data.frame(IN_up_uniq_HeLa), col_names = F, append = F,
quote = 'none', escape = 'none', delim = '\t',
file = file.path(dse_IDs_dir, '3_UP_INTRONS_HeLa.tab') )
write_delim(x = as.data.frame(IN_do_uniq_HeLa), col_names = F, append = F,
quote = 'none', escape = 'none', delim = '\t',
file = file.path(dse_IDs_dir, '4_DOWN_INTRONS_HeLa.tab') )
write_delim(x = as.data.frame(IN_up_uniq_HepG2), col_names = F, append = F,
quote = 'none', escape = 'none', delim = '\t',
file = file.path(dse_IDs_dir, '5_UP_INTRONS_HepG2.tab') )
write_delim(x = as.data.frame(IN_do_uniq_HepG2), col_names = F, append = F,
quote = 'none', escape = 'none', delim = '\t',
file = file.path(dse_IDs_dir, '6_DOWN_INTRONS_HepG2.tab'))A3_common <- A3_HeLa[(A3_HeLa %in% A3_HepG2)]
A5_common <- A5_HeLa[(A5_HeLa %in% A5_HepG2)]
A3_uniqHeLa <- A3_HeLa[!A3_HeLa %in% A3_HepG2]
stopifnot(length(A3_uniqHeLa) + length(A3_common) == length(A3_HeLa))
A3_uniqHepG2 <- A3_HepG2[!(A3_HepG2 %in% A3_HeLa)]
stopifnot(length(A3_uniqHepG2) + length(A3_common) == length(A3_HepG2))
A5_uniqHeLa <- A5_HeLa[!A5_HeLa %in% A5_HepG2]
stopifnot(length(A5_uniqHeLa) + length(A5_common) == length(A5_HeLa))
A5_uniqHepG2 <- A5_HepG2[!(A5_HepG2 %in% A5_HeLa)]
stopifnot(length(A5_uniqHepG2) + length(A5_common) == length(A5_HepG2))Export to file the differentially spliced splice sites IDs
write_delim(x = as.data.frame(A3_common), col_names = F, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '1_ALT_ACCEPTOR_SPLICE_SITES_BOTH.tab'), delim = '\t')
write_delim(x = as.data.frame(A5_common), col_names = F, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '2_ALT_DONOR_SPLICE_SITES_BOTH.tab'), delim = '\t')
write_delim(x = as.data.frame(A3_uniqHeLa), col_names = F, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '3_ALT_ACCEPTOR_SPLICE_SITES_HeLa.tab'), delim = '\t')
write_delim(x = as.data.frame(A3_uniqHepG2), col_names = F, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '4_ALT_ACCEPTOR_SPLICE_SITES_HepG2.tab'), delim = '\t')
write_delim(x = as.data.frame(A5_uniqHeLa), col_names = F, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '5_ALT_DONOR_SPLICE_SITES_HeLa.tab'), delim = '\t')
write_delim(x = as.data.frame(A5_uniqHepG2), col_names = F, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '6_ALT_DONOR_SPLICE_SITES_HepG2.tab'), delim = '\t')rbind(
data.frame(EXPERIMENT = 'HeLa-SRRM2-KD',
TYPE = "Exon",
EVENT = c(EX_up_common_HH, EX_up_uniq_HeLa),
DIRECTION = 'UP'),
data.frame(EXPERIMENT = 'HepG2-SRRM2-KD',
TYPE = "Exon",
EVENT = c(EX_up_common_HH, EX_up_uniq_HepG2),
DIRECTION = 'UP'),
data.frame(EXPERIMENT = 'HeLa-SRRM2-KD',
TYPE = "Exon",
EVENT = c(EX_do_common_HH, EX_do_uniq_HeLa),
DIRECTION = 'DOWN'),
data.frame(EXPERIMENT = 'HepG2-SRRM2-KD',
TYPE = "Exon",
EVENT = c(EX_do_common_HH, EX_do_uniq_HepG2),
DIRECTION = 'DOWN')
) -> exons_IDs
rbind(
data.frame(EXPERIMENT = 'HeLa-SRRM2-KD',
TYPE = "Intron",
EVENT = c(IN_up_common_HH, IN_up_uniq_HeLa),
DIRECTION = 'UP'),
data.frame(EXPERIMENT = 'HepG2-SRRM2-KD',
TYPE = "Intron",
EVENT = c(IN_up_common_HH, IN_up_uniq_HepG2),
DIRECTION = 'UP'),
data.frame(EXPERIMENT = 'HeLa-SRRM2-KD',
TYPE = "Intron",
EVENT = c(IN_do_common_HH, IN_do_uniq_HeLa),
DIRECTION = 'DOWN'),
data.frame(EXPERIMENT = 'HepG2-SRRM2-KD',
TYPE = "Intron",
EVENT = c(IN_do_common_HH, IN_do_uniq_HepG2),
DIRECTION = 'DOWN')
) -> introns_IDs
rbind(
data.frame(EXPERIMENT = 'HeLa-SRRM2-KD',
TYPE = "Alt. 3' ss",
EVENT = c(A3_common, A3_uniqHeLa)),
data.frame(EXPERIMENT = 'HepG2-SRRM2-KD',
TYPE = "Alt. 3' ss",
EVENT = c(A3_common, A3_uniqHepG2)),
data.frame(EXPERIMENT = 'HeLa-SRRM2-KD',
TYPE = "Alt. 5' ss",
EVENT = c(A5_common, A5_uniqHeLa)),
data.frame(EXPERIMENT = 'HepG2-SRRM2-KD',
TYPE = "Alt. 5' ss",
EVENT = c(A5_common, A5_uniqHepG2))
) -> alt_splice_sites_IDs
alt_splice_sites_IDs$DIRECTION = 'BOTH'
shared_and_uniq_AS_events <- rbind(exons_IDs, introns_IDs, alt_splice_sites_IDs)Save to file
write_delim(x = shared_and_uniq_AS_events, col_names = T, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '0_ALL_SHARED_UNIQUE_ALT_SPLICED_EVENTS.tab'), delim = '\t')SHR_UNI_AS_events <- read_delim(file = file.path(dse_IDs_dir, '0_ALL_SHARED_UNIQUE_ALT_SPLICED_EVENTS.tab'),
delim = '\t', quote = 'none',
col_names = T, show_col_types = FALSE)Annotate these events by fetching the PSI and coordinates from the main inclusion table.
fetch_diff_IDs <- unique(SHR_UNI_AS_events$EVENT)
message(length(fetch_diff_IDs), ' events to get info... this may take some time')4071 events to get info... this may take some time
This next step might take some time to process.
grep_psi(inclusion_tbl = psi_path, vst_id = fetch_diff_IDs) |>
tidy_vst_psi() -> shr_uni_psi_dfMerge all the differentially spliced events with the annotation data. 3.
DSE_tbl <- right_join(x = SHR_UNI_AS_events, y = shr_uni_psi_df, by = join_by(EVENT), multiple = "all")Warning in right_join(x = SHR_UNI_AS_events, y = shr_uni_psi_df, by = join_by(EVENT), : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 32267 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Export the annotated differentially spliced events.
stopifnot( length(unique(SHR_UNI_AS_events$EVENT)) == length(unique(DSE_tbl$EVENT)) )
write_delim(x = DSE_tbl, col_names = T, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '0_ALL_SHARED_UNIQUE_ALT_SPLICED_EVENTS_PSI.tab'),
delim = '\t')Import this table instead of processing it every time.
DSE_tbl <- read_delim(file = file.path(dse_IDs_dir, '0_ALL_SHARED_UNIQUE_ALT_SPLICED_EVENTS_PSI.tab'),
delim = '\t', quote = 'none',
col_names = T, show_col_types = FALSE)Filter out all the samples not needed in this analysisi.
DSE_tbl_HeLa <- subset(DSE_tbl, EXPERIMENT == "HeLa-SRRM2-KD") |>
subset(!grepl(Sample, pattern = '^HepG2')) |>
subset(grepl(Sample, pattern = 'CNTRL|SRRM2'))
DSE_tbl_HepG2 <- subset(DSE_tbl, EXPERIMENT == "HepG2-SRRM2-KD") |>
subset(!grepl(Sample, pattern = '^HeLa')) |>
subset(grepl(Sample, pattern = 'CNTRL|SRRM2'))Define samples and supersample groups in the HeLa dataset
mrgd_cntrl <- "HeLa_CNTRL"
mrgd_SRRM2 <- "HeLa_SRRM2"
CNTRLKD <- c("HeLa_CNTRL_A", "HeLa_CNTRL_B", "HeLa_CNTRL_C")
SRRM2KD <- c("HeLa_SRRM2_A", "HeLa_SRRM2_B", "HeLa_SRRM2_C")Calculate ∆PSIs for TAF dataset.
dPSI_thshld <- 15
DSE_tbl_HeLa |>
group_by(EVENT) |>
summarise(mPSI_CNTRL = mean(PSI[Sample %in% mrgd_cntrl], na.rm = T),
mPSI_SRRM2 = mean(PSI[Sample %in% mrgd_SRRM2], na.rm = T) ) |>
mutate(dPSI_SRRM2 = mPSI_SRRM2 - mPSI_CNTRL ) |>
subset( is.finite(dPSI_SRRM2) ) |>
select(EVENT, dPSI_SRRM2) |>
unique() |>
# Add the ∆PSI of the filtered events to the original data frame
left_join(y = DSE_tbl_HeLa, by = join_by(EVENT)) |>
relocate(GENE, .before = EVENT) |>
relocate(starts_with('dPSI'), .after = PSI) |>
select(GENE, EVENT, starts_with('dPSI')) |>
rename(dPSI_SRRM2_HeLa = dPSI_SRRM2) |>
unique() |>
mutate(AS_TYPE = case_when( grepl(pattern = "^HsaALTA", x = EVENT) ~ 'Alt. Acceptor',
grepl(pattern = "^HsaALTD", x = EVENT) ~ 'Alt. Donor',
grepl(pattern = "^HsaINT", x = EVENT) ~ 'Intron',
grepl(pattern = "^HsaEX", x = EVENT) ~ 'Exon') ) |>
relocate(AS_TYPE, .after = EVENT) |>
mutate(direction_SRRM2_HeLa = case_when( dPSI_SRRM2_HeLa >= dPSI_thshld ~ 'UP',
dPSI_SRRM2_HeLa <= -dPSI_thshld ~ 'DOWN',
between(dPSI_SRRM2_HeLa, left = -dPSI_thshld, right = dPSI_thshld) ~ 'NONE') ) |>
mutate(direction_SRRM2_HeLa = factor(direction_SRRM2_HeLa, levels = c('UP', 'DOWN', 'NONE'))) |>
arrange(direction_SRRM2_HeLa, desc(dPSI_SRRM2_HeLa)) -> tidy_dPSI_SRRM2_HeLaNote: these ∆PSIs are are calculated from the mean ∆PSI, similarly to how vast-tools compare would calcualte them, so the events that are found differentially spliced with the diff method are labelled as “NONE” because they are below 15 PSI.
head(subset(tidy_dPSI_SRRM2_HeLa, direction_SRRM2_HeLa == "NONE"))# A tibble: 6 × 5
GENE EVENT AS_TYPE dPSI_SRRM2_HeLa direction_SRRM2_HeLa
<chr> <chr> <chr> <dbl> <fct>
1 CUX1 HsaEX6009566 Exon 14.7 NONE
2 NAB1 HsaALTA0005536-1/2 Alt. Acceptor 14.4 NONE
3 KLHL22 HsaEX0034811 Exon 14.3 NONE
4 EML3 HsaINT0055976 Intron 14.2 NONE
5 CASP3 HsaEX0012576 Exon 14.2 NONE
6 FBXW11 HsaEX0025363 Exon 14.1 NONE
Export the annotated differentially spliced events.
write_delim(x = tidy_dPSI_SRRM2_HeLa, col_names = T, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '0_ALL_SHARED_UNIQUE_ALT_SPLICED_EVENTS_dPSI_HeLa.tab'),
delim = '\t')Define HepG2 samples
mrgd_cntrl <- "HepG2_CNTRL"
mrgd_SRRM2 <- "HepG2_SRRM2"
CNTRLKD <- c("HepG2_CNTRL_A", "HepG2_CNTRL_B", "HepG2_CNTRL_C")
SRRM2KD <- c("HepG2_SRRM2_Aa", "HepG2_SRRM2_Ab",
"HepG2_SRRM2_Ba", "HepG2_SRRM2_Bb",
"HepG2_SRRM2_Ca", "HepG2_SRRM2_Cb")Calculate ∆PSI in HepG2 cells.
DSE_tbl_HepG2 |>
group_by(EVENT) |>
summarise(mPSI_CNTRL = mean(PSI[Sample %in% mrgd_cntrl], na.rm = T),
mPSI_SRRM2 = mean(PSI[Sample %in% mrgd_SRRM2], na.rm = T) ) |>
# calculate ∆PSI
mutate(dPSI_SRRM2 = mPSI_SRRM2 - mPSI_CNTRL ) |>
subset( is.finite(dPSI_SRRM2) ) |>
select(EVENT, dPSI_SRRM2) |>
unique() |>
# Add the ∆PSI of the filtered events to the original data frame
left_join(y = DSE_tbl_HepG2, by = join_by(EVENT)) |>
relocate(GENE, .before = EVENT) |>
relocate(starts_with('dPSI'), .after = PSI) |>
select(GENE, EVENT, starts_with('dPSI')) |>
rename(dPSI_SRRM2_HepG2 = dPSI_SRRM2) |>
unique() |>
mutate(AS_TYPE = case_when( grepl(pattern = "^HsaALTA", x = EVENT) ~ 'Alt. Acceptor',
grepl(pattern = "^HsaALTD", x = EVENT) ~ 'Alt. Donor',
grepl(pattern = "^HsaINT", x = EVENT) ~ 'Intron',
grepl(pattern = "^HsaEX", x = EVENT) ~ 'Exon') ) |>
relocate(AS_TYPE, .after = EVENT) |>
mutate(direction_SRRM2_HepG2 = case_when( dPSI_SRRM2_HepG2 >= dPSI_thshld ~ 'UP',
dPSI_SRRM2_HepG2 <= -dPSI_thshld ~ 'DOWN',
between(dPSI_SRRM2_HepG2, left = -dPSI_thshld, right = dPSI_thshld) ~ 'NONE') ) |>
mutate(direction_SRRM2_HepG2 = factor(direction_SRRM2_HepG2, levels = c('UP', 'DOWN', 'NONE'))) |>
arrange(direction_SRRM2_HepG2, desc(dPSI_SRRM2_HepG2)) -> tidy_dPSI_SRRM2_HepG2Export the annotated differentially spliced events.
write_delim(x = tidy_dPSI_SRRM2_HepG2, col_names = T, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '0_ALL_SHARED_UNIQUE_ALT_SPLICED_EVENTS_dPSI_HepG2.tab'),
delim = '\t')Generate a complete supplementary table for the paper.
Supp_Tbl_SRRM2_HeLa <-
left_join(x = tidy_dPSI_SRRM2_HeLa, y = DSE_tbl_HeLa, by = join_by(EVENT, GENE) ) |>
select( !c(TYPE, Quality_Score_Type, direction_SRRM2_HeLa) ) |>
relocate(EVENT, .after = GENE) |>
relocate(AS_TYPE, .after = COMPLEX) |>
relocate(EXPERIMENT, .before = GENE) |>
relocate(dPSI_SRRM2_HeLa, .after = PSI) |>
relocate(DIRECTION, .after = Quality_Score_Value) |>
unique() |>
arrange(AS_TYPE, desc(dPSI_SRRM2_HeLa))The direction column indicates which event are up or down regulated in SRRM2-KD alone or are both up&down regulated since they are alternative splice sites.
Supp_Tbl_SRRM2_HeLa |>
select(EVENT, DIRECTION) |>
unique() |>
pull(DIRECTION) |>
table()
BOTH DOWN UP
374 543 544
Supp_Tbl_SRRM2_HeLa |>
select(EVENT, AS_TYPE) |>
unique() |>
pull(AS_TYPE) |>
table()
Alt. Acceptor Alt. Donor Exon Intron
183 191 824 262
Write to file as a tab delimited file.
write_delim(x = Supp_Tbl_SRRM2_HeLa, col_names = T, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '0_HeLa_Supplemtentary_Table_DSE_SHARED_UNIQUE.tab'),
delim = '\t')Now create a supplementary table for the HepG2, just in case it will be needed.
Supp_Tbl_SRRM2_HepG2 <-
left_join(x = tidy_dPSI_SRRM2_HepG2, y = DSE_tbl_HepG2, , by = join_by(EVENT, GENE) ) |>
select(!c(TYPE, Quality_Score_Type, direction_SRRM2_HepG2)) |>
relocate(EVENT, .after = GENE) |>
relocate(AS_TYPE, .after = COMPLEX) |>
relocate(EXPERIMENT, .before = GENE) |>
relocate(dPSI_SRRM2_HepG2, .after = PSI) |>
relocate(DIRECTION, .after = Quality_Score_Value) |>
unique() |>
arrange(AS_TYPE, desc(dPSI_SRRM2_HepG2))The direction column indicates which event are up or down regulated in SRRM2-KD alone or are both up&down regulated since they are alternative splice sites.
Supp_Tbl_SRRM2_HepG2 |>
select(EVENT, DIRECTION) |>
unique() |>
pull(DIRECTION) |>
table()
BOTH DOWN UP
735 1158 773
Supp_Tbl_SRRM2_HepG2 |>
select(EVENT, AS_TYPE) |>
unique() |>
pull(AS_TYPE) |>
table()
Alt. Acceptor Alt. Donor Exon Intron
381 354 991 940
Write to file as a tab delimited file.
write_delim(x = Supp_Tbl_SRRM2_HepG2, col_names = T, append = F, quote = 'none', escape = 'none',
file = file.path(dse_IDs_dir, '0_HepG2_Supplemtentary_Table_DSE_SHARED_UNIQUE.tab'),
delim = '\t')The interesting observation is that there’s a little overlap between between HeLa and HepG2 SRRM2 KD.
Further exploration would be required to understand if this is a cell-line specific effect of a different batch issue. Analyzing SRRM2-KD in other cellular context could shed some light on this observation.
sessioninfo::session_info()─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.2 (2024-10-31)
os AlmaLinux 9.5 (Teal Serval)
system x86_64, linux-gnu
ui X11
language en_US.UTF-8
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Madrid
date 2025-03-03
pandoc 3.4 @ /users/mirimia/narecco/software/miniconda/envs/analysis/bin/ (via rmarkdown)
quarto 1.6.41 @ /users/mirimia/narecco/software/miniconda/envs/analysis/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.4.1)
ade4 1.7-23 2025-02-14 [1] CRAN (R 4.4.2)
AnnotationDbi 1.68.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.1)
Biobase 2.66.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
BiocFileCache 2.14.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
BiocGenerics 0.52.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
BiocParallel 1.40.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
biomaRt 2.62.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
Biostrings 2.74.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
bit 4.5.0.1 2024-12-03 [1] CRAN (R 4.4.2)
bit64 4.5.2 2024-09-22 [1] CRAN (R 4.4.1)
bitops 1.0-9 2024-10-03 [1] CRAN (R 4.4.1)
blob 1.2.4 2023-03-17 [1] CRAN (R 4.4.1)
broom 1.0.7 2024-09-26 [1] CRAN (R 4.4.1)
cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.1)
Cairo * 1.6-2 2023-11-28 [1] CRAN (R 4.4.1)
car 3.1-3 2024-09-27 [1] CRAN (R 4.4.1)
carData 3.0-5 2022-01-06 [1] CRAN (R 4.4.1)
cli 3.6.4 2025-02-13 [1] CRAN (R 4.4.2)
codetools 0.2-20 2024-03-31 [1] CRAN (R 4.4.1)
colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.1)
crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.1)
csaw 1.40.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
curl 6.0.1 2024-11-14 [1] CRAN (R 4.4.2)
DBI 1.2.3 2024-06-02 [1] CRAN (R 4.4.1)
dbplyr 2.5.0 2024-03-19 [1] CRAN (R 4.4.1)
DelayedArray 0.32.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
desc 1.4.3 2023-12-10 [1] CRAN (R 4.4.1)
DESeq2 1.46.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.1)
digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)
dplyr 1.1.4 2023-11-17 [1] CRAN (R 4.4.1)
edgeR 4.4.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.1)
evaluate 1.0.3 2025-01-10 [1] CRAN (R 4.4.2)
farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.1)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.1)
filelock 1.0.3 2023-12-11 [1] CRAN (R 4.4.1)
forcats 1.0.0 2023-01-29 [1] CRAN (R 4.4.1)
foreign 0.8-88 2025-01-12 [1] CRAN (R 4.4.2)
Formula 1.2-5 2023-02-24 [1] CRAN (R 4.4.1)
fs 1.6.5 2024-10-30 [1] CRAN (R 4.4.1)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.1)
GenomeInfoDb 1.42.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
GenomeInfoDbData 1.2.13 2025-03-03 [1] Bioconductor
GenomicRanges 1.58.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
ggalluvial 0.12.5 2023-02-22 [1] CRAN (R 4.4.1)
ggfittext 0.10.2 2024-02-01 [1] CRAN (R 4.4.1)
ggforce * 0.4.2 2024-02-19 [1] CRAN (R 4.4.1)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.1)
ggrepel * 0.9.6 2024-09-07 [1] CRAN (R 4.4.1)
ggseqlogo 0.2 2024-02-09 [1] CRAN (R 4.4.1)
glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.4.1)
gtable 0.3.6 2024-10-25 [1] CRAN (R 4.4.1)
gtools 3.9.5 2023-11-20 [1] CRAN (R 4.4.1)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.1)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.1)
httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.1)
httr 1.4.7 2023-08-15 [1] CRAN (R 4.4.1)
httr2 1.1.0 2025-01-18 [1] CRAN (R 4.4.2)
IRanges 2.40.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
jsonlite 1.9.0 2025-02-19 [1] CRAN (R 4.4.2)
KEGGREST 1.46.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
knitr 1.49 2024-11-08 [1] CRAN (R 4.4.1)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.1)
later 1.4.1 2024-11-27 [1] CRAN (R 4.4.2)
lattice 0.22-6 2024-03-20 [1] CRAN (R 4.4.1)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.1)
limma 3.62.1 2024-11-03 [1] Bioconductor 3.20 (R 4.4.2)
locfit 1.5-9.11 2025-02-03 [1] CRAN (R 4.4.2)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.1)
MASS 7.3-64 2025-01-04 [1] CRAN (R 4.4.2)
Matrix 1.7-2 2025-01-23 [1] CRAN (R 4.4.2)
MatrixGenerics 1.18.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
matrixStats 1.5.0 2025-01-07 [1] CRAN (R 4.4.2)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.1)
metapod 1.14.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
MetBrewer 0.2.0 2022-03-21 [1] CRAN (R 4.4.1)
mime 0.12 2021-09-28 [1] CRAN (R 4.4.1)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.1)
msa 1.38.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.1)
R niar * 0.3.0 <NA> [?] <NA>
patchwork 1.3.0 2024-09-16 [1] CRAN (R 4.4.1)
pheatmap * 1.0.12 2019-01-04 [1] CRAN (R 4.4.1)
pillar 1.10.1 2025-01-07 [1] CRAN (R 4.4.2)
pkgbuild 1.4.6 2025-01-16 [1] CRAN (R 4.4.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.1)
pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
plyr 1.8.9 2023-10-02 [1] CRAN (R 4.4.1)
png 0.1-8 2022-11-29 [1] CRAN (R 4.4.1)
polyclip 1.10-7 2024-07-23 [1] CRAN (R 4.4.1)
prettyunits 1.2.0 2023-09-24 [1] CRAN (R 4.4.1)
profvis 0.4.0 2024-09-20 [1] CRAN (R 4.4.1)
progress 1.2.3 2023-12-06 [1] CRAN (R 4.4.1)
promises 1.3.2 2024-11-28 [1] CRAN (R 4.4.2)
psychTools 2.4.2 2024-01-18 [1] CRAN (R 4.4.1)
purrr 1.0.4 2025-02-05 [1] CRAN (R 4.4.2)
R6 2.6.1 2025-02-15 [1] CRAN (R 4.4.2)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.4.1)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.4.1)
Rcpp 1.0.14 2025-01-12 [1] CRAN (R 4.4.2)
readr 2.1.5 2024-01-10 [1] CRAN (R 4.4.1)
remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.1)
rlang 1.1.5 2025-01-17 [1] CRAN (R 4.4.2)
rmarkdown 2.29 2024-11-04 [1] CRAN (R 4.4.1)
rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.1)
Rsamtools 2.22.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
RSQLite 2.3.9 2024-12-03 [1] CRAN (R 4.4.2)
rstatix * 0.7.2 2023-02-01 [1] CRAN (R 4.4.1)
rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.4.1)
S4Arrays 1.6.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
S4Vectors 0.44.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.1)
seqinr 4.2-36 2023-12-08 [1] CRAN (R 4.4.1)
sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.4.2)
shiny 1.10.0 2024-12-14 [1] CRAN (R 4.4.2)
SparseArray 1.6.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
statmod 1.5.0 2023-01-06 [1] CRAN (R 4.4.1)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.1)
stringr 1.5.1 2023-11-14 [1] CRAN (R 4.4.1)
SummarizedExperiment 1.36.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.1)
tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.4.1)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.1)
tweenr 2.0.3 2024-02-26 [1] CRAN (R 4.4.1)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.1)
UCSC.utils 1.2.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
UpSetR * 1.4.0 2019-05-22 [1] CRAN (R 4.4.1)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.1)
usethis 3.1.0 2024-11-26 [1] CRAN (R 4.4.2)
utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.1)
viridis * 0.6.5 2024-01-29 [1] CRAN (R 4.4.1)
viridisLite * 0.4.2 2023-05-02 [1] CRAN (R 4.4.1)
vroom 1.6.5 2023-12-05 [1] CRAN (R 4.4.1)
withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1)
xfun 0.51 2025-02-19 [1] CRAN (R 4.4.2)
XICOR 0.4.1 2023-04-21 [1] CRAN (R 4.4.1)
xml2 1.3.6 2023-12-04 [1] CRAN (R 4.4.1)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.1)
XVector 0.46.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.1)
zlibbioc 1.52.0 2024-10-29 [1] Bioconductor 3.20 (R 4.4.2)
[1] /users/mirimia/narecco/software/miniconda/envs/analysis/lib/R/library
* ── Packages attached to the search path.
R ── Package was removed from disk.
──────────────────────────────────────────────────────────────────────────────
This document is generated using Quarto which enables to weave together content and executable code into a finished document. This is an improved version of what used to be called ‘Rmarkdown’. The document hides all code chunks, but they can be opened up with the drop down arrow. On the top right corner there’s a toggle for dark-mode. To learn more about Quarto see here.